Module leapyear.analytics

Statistics and machine learning algorithms.

LeapYear analyses are functions that are executed by the server to compute statistics or to perform machine learning tasks on DataSets. These functions return an Analysis type, which is executed on the server by calling the run() method.

For simple statistics, such as count() or mean(), the values can be extracted using the following pattern:

>>> from leapyear import Client, DataSet
>>> from leapyear.analytics import count_rows, mean
>>> client = Client(url='http://ly-server:4401', username='admin', password='password')
>>> dataset = DataSet.from_table('db.table')
>>> dataset_rows_analysis = count_rows(dataset)
>>> n_rows = dataset_rows_analysis.run()
>>> print(n_rows)
10473
>>> dataset_mean_x_analysis = mean('x0', dataset)
>>> mean_x = dataset_mean_x_analysis.run()
>>> print(mean_x)
5.234212346345

The computation of all univariate statistics follows the pattern for mean(). For more complicated machine learning tasks, multiple columns must be specified, depending on the task.

Unsupervised learning tasks (like clustering) will generally require the specification of which features in the DataSet to use. Supervised learning tasks (like regression) will additionally require the specification of a target variable.

For example, we can train a linear regression model as follows:

>>> from leapyear.analytics import generalized_linreg
>>> regression = generalized_linreg(['x0', 'x1'], 'y', dataset, affine=True, l2reg=1.0)
>>> model = regression.run()

Helper routines are available for performing cross-validation (see cross_val_score_linreg()). Note that, unlike other analyses, they are immediately executed (without calling run()):

>>> from leapyear.analytics import cross_val_score_linreg
>>> cross_val_score = cross_val_score_linreg(
>>>     ['x0', 'x1'], 'y', dataset, cv=3,
>>>     affine=True, l1reg=0.1, l2reg=1.0, scorer='mse'
>>> )

Data Analysis

leapyear.analytics.count(attr, dataset=None, drop_nulls=False, target_relative_error=None, max_budget=None)

Analysis: Count the elements of an attribute.

This analysis can be executed using the run method to compute the approximate count of elements, including NULL values.

The user can request additional information about the computation with run(rich_result=True). In this case, an object of RandomizationInterval, will be generated likely including the precise value of the computation on the data sample.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the count of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to ignore NULL values. Default: False.

  • target_relative_error (Optional[float]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.

  • max_budget (Optional[float]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.

Returns

Analysis object that can be executed using the run method.

Return type

CountAnalysisWithRI

leapyear.analytics.count_rows(dataset, target_relative_error=None, max_budget=None)

Analysis: Count the number of rows in a dataset.

This analysis can be executed using the run method to compute the approximate number of rows in the dataset.

The user can request additional information about the computation with run(rich_result=True). In this case, an object of RandomizationInterval will be generated, likely including the precise value of the computation on the data sample.

Parameters
  • dataset (DataSet) – The input dataset.

  • target_relative_error (Optional[float]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.

  • max_budget (Optional[float]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.

Returns

Analysis object that can be executed using the run method.

Return type

CountAnalysisWithRI

leapyear.analytics.count_distinct(attr, dataset=None, drop_nulls=False, target_relative_error=None, max_budget=None)

Analysis: Count the unique elements of an attribute.

Parameters
  • attr (Union[Attribute, str, Sequence[Union[Attribute, str]]]) – The attribute or attributes to compute the distinct count of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) –

    Remove any records with null. Unique values associated with records containing

    nulls are not included in the count.

    target_relative_error

    A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.

  • max_budget (Optional[float]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.

Returns

Prepared analysis of the count.

Return type

Analysis

leapyear.analytics.count_distinct_rows(dataset, target_relative_error=None, max_budget=None)

Analysis: Count the number of distinct rows in a dataset.

Parameters
  • dataset (DataSet) – The input dataset.

  • target_relative_error (Optional[float]) – A float value between 0 and 1 indicating the level of relative error that should be targeted for this computation. If specified, the system will attempt to ensure that the absolute value of the relative error between the randomized result and the true count is roughly target_relative_error. If this is not possible due to budget constraints set by the admin, the system will return a randomized result with the smallest randomization effect allowed. If not specified, a default value specified by the admin is used. This can only be used if the admin has turned on the adaptive count feature.

  • max_budget (Optional[float]) – The maximum amount of budget that the system should spend while trying to achieve target_relative_error. If specified, the absolute amount of budget spent will not exceed max_budget. If the user specifies a value greater than the maximum budget for the computation set by the admin, system will use the admin-set maximum. If the system can achieve target_relative_error while spending less than max_budget, it will do so.

Returns

Analysis for counting the number of distinct rows.

Return type

ScalarAnalysis

leapyear.analytics.mean(attr, dataset=None, drop_nulls=False)

Analysis: Compute the mean of an attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the computation to go through.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the mean of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithCI

leapyear.analytics.sum(attr, dataset=None, drop_nulls=False)

Analysis: Compute the sum of a numeric attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the copmutation to go through.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the sum of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithRI

leapyear.analytics.variance(attr, dataset=None, drop_nulls=False)

Analysis: Compute the variance of an attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the copmutation to go through.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the variance of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithCI

leapyear.analytics.min(attr, dataset=None, drop_nulls=False)

Analysis: Compute the minimum value of an attribute.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the min of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the min.

Return type

ScalarAnalysis

Note

The minimum reported is the 1/1000 quantile of the attribute.

When the attribute being analyzed has a very narrow range of possible values, the minimum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the minimum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the minimum, and rescale the returned value by width/10.

The result of this analysis may be very different from the true minimum of the data sample in the following two scenarios:

1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the minimum computed is the 1/1000 quantile of the attribute, and

2. When the public lower bound is very different from the true minimum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.

leapyear.analytics.max(attr, dataset=None, drop_nulls=False)

Analysis: Compute the maximum value of an attribute.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the max of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the max.

Return type

Analysis

Note

The maximum reported is the 999/1000 quantile of the attribute.

When the attribute being analyzed has a very narrow range of possible values, the maximum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the maximum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the maximum, and rescale the returned value by width/10.

The result of this analysis may be very different from the true maximum of the data sample in the following two scenarios:

1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the maximum computed is the 999/1000 quantile of the attribute, and

2. When the public upper bound is very different from the true maximum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.

leapyear.analytics.median(attr, dataset=None, drop_nulls=False)

Analysis: Compute the median value of an attribute.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the median of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the median.

Return type

Analysis

Note

When the attribute being analyzed has a very narrow range of possible values, the median returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the median returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the median, and rescale the returned value by width/10.

leapyear.analytics.quantile(q, attr, dataset=None, drop_nulls=False)

Analysis: Compute a certain quantile q of an attribute.

Parameters
  • q (float) – Quantile to compute, which must be between 0 and 1 inclusive.

  • attr (Union[Attribute, str]) – The attribute to compute the quantile of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the quantile.

Return type

Analysis

Note

When the attribute being analyzed has a very narrow range of possible values, the quantile returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the quantile returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the quantile, and rescale the returned value by width/10.

leapyear.analytics.skewness(attr, dataset=None, drop_nulls=False)

Analysis: Compute the skewness of an attribute.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the skewness of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the skewness.

Return type

Analysis

leapyear.analytics.kurtosis(attr, dataset=None, drop_nulls=False)

Analysis: Compute the excess kurtosis of an attribute.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the kurtosis of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the kurtosis.

Return type

Analysis

leapyear.analytics.iqr(attr, dataset=None, drop_nulls=False)

Analysis: Compute the interquartile range of an attribute.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the interquartile range of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

  • drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the iqr.

Return type

Analysis

leapyear.analytics.histogram(attr, dataset=None, bins=10, interval=None)

Analysis: Compute the histogram of the attribute in the dataset.

Parameters
  • attr (Union[Attribute, str]) – The attribute to compute the histogram of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when attr is a string.

  • bins (int) – Number of bins between the bounds. (default=10)

  • interval (Optional[Tuple[float, float]]) – The lower and upper bound of the histogram. Defaults to attribute bounds if None.

Returns

Prepared analysis of the histogram.

Return type

Analysis

leapyear.analytics.histogram2d(x_attr, y_attr, dataset=None, x_bins=10, y_bins=10, x_range=None, y_range=None)

Analysis: Compute the 2D histogram of two attributes in the dataset.

Parameters
  • x_attr (Union[Attribute, str]) – The attribute to use to compute the first dimension of the histogram.. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • y_attr (Union[Attribute, str]) – The attribute to use to compute the first dimension of the histogram.. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

  • dataset (Optional[DataSet]) – The dataset to use when x_attr or y_attr are strings.

  • x_bins (int) – Number of bins between the bounds in the first attribute.

  • y_bins (int) – Number of bins between the bounds in the second attribute.

  • x_range (Optional[Tuple[float, float]]) – The lower and upper bound of the first attribute for the histogram.

  • y_range (Optional[Tuple[float, float]]) – The lower and upper bound of the second attribute for the histogram.

Returns

Prepared analysis of the histogram.

Return type

Analysis

leapyear.analytics.correlation_matrix(xs, dataset, *, center=True, scale=True, **kwargs)

Analysis: Compute the correlation matrix of the set of attributes.

NOTE: This analysis does not require run().

Parameters
  • xs (Sequence[str]) – A list of attribute names to compute correlation matrix for.

  • dataset (DataSet) – The DataSet containing these attributes.

  • center (bool) – Whether to center the columns before computing correlation matrix. If False, proceed assuming the columns are already centered.

  • scale (bool) – Whether to divide covariance matrix by number of rows. If False, do not divide.

  • max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to waiting forever.

Returns

The correlation matrix.

Return type

np.ndarray

leapyear.analytics.covariance_matrix(xs, dataset, *, center=True, scale=True, **kwargs)

Analysis: Compute the covariance matrix of the set of attributes.

NOTE: This analysis does not require run().

Parameters
  • xs (Sequence[str]) – A list of attribute names that are the features.

  • dataset (DataSet) – The DataSet of the attributes.

  • center (bool) – Whether to center the columns before compute the covariance matrix. If False, assume the columns are centered.

  • scale (bool) – Whether to divide the matrix by number of rows. If False, do not divide.

  • max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to waiting forever.

Returns

The covariance matrix.

Return type

np.ndarray

leapyear.analytics.describe(dataset, attributes=None)

Describe the columns of the dataset for use in data exploration.

The describe function provides a way for an analyst to perform initial rough data exploration on a dataset. To get more accurate statistics, the individual functions mean(), count(), et cetera, are recommended. This function does not use the analysis cache of the other statistics functions.

Numeric columns are described by their count, mean, standard deviation, minimum, maximum and the quartiles. Categorical columns (factors and booleans) are described by their count, distinct count and frequency of the most frequent element.

Parameters
  • dataset (DataSet) – The DataSet to be described

  • attributes (Union[None, Attribute, str, Sequence[Union[Attribute, str]]]) – The attributes to describe. If a value is not provided, or None, describe all attributes.

Returns

Prepared analysis for describing the dataset. Execute the analysis using the run() method.

Return type

DescribeAnalysis

leapyear.analytics.groupby_agg_view(dataset, attrs, agg_attr=None, agg_type=<GroupByAggType.COUNT: 1>, *, max_groupby_agg_keys=100000000, size_threshold=None, agg_attr_and_type=None)

Compute aggregate statistic within each group and output aggregate results.

Only groups with estimated size larger than minimum_dataset_size will be returned. This parameter can be set in run.

Parameters
  • dataset (DataSet) – The DataSet to perform groupby and aggregation on

  • attrs (Sequence[Union[Attribute, str]]) – List of attributes to perform groupby.

  • agg_attr (Optional[str]) – Compute aggregate statistics on this column within each group

  • agg_type (Union[GroupByAggType, str]) – Aggregate type. ‘count’, ‘mean’ or ‘sum’.

  • max_groupby_agg_keys (int) – This value prevents submitting computations that have a very large number of groupby keys. By default, it raises GroupbyAggTooManyKeysError if the number of groups exceeds 100000000.

  • size_threshold (Optional[int]) – Deprecated: see minimum_dataset_size in the run method.

  • agg_attr_and_type (Union[Tuple[Union[GroupByAggType, str], Optional[str]], Sequence[Tuple[Union[GroupByAggType, str], Optional[str]]], None]) – List of tuples (agg_type, agg_attr). Compute aggregate statistics defined by the agg_type on the column within each group. agg_type can be ‘count’, ‘mean’ or ‘sum’.

Returns

Analysis object that can be executed using run method to return aggregation results. The results can be accessed as a pandas dataframe using .to_dataframe().

Return type

GroupbyAggAnalysis

Note: privacy exposure estimate for this analysis is not supported.

Example

For each age group and gender, compute the mean income.

>>> groupby_agg_view(ds, ["AGE", "GENDER"], "INCOME", "mean").run(minimum_dataset_size=1000)

For each week, compute the mean and total transaction amount.

>>> groupby_agg_view(ds, ["WEEK"], agg_attr_and_type=[("mean", "AMOUNT"), ("sum", "AMOUNT")]).run()

Look at Randomization Intervals for each group (only for ‘count’ and ‘sum’).

>>> rr = groupby_agg_view(ds, ["WEEK"], "AMOUNT", "mean").run(rich_result=True)
>>> ri_dict = rr.metadata
>>> ri_dict
{
    (1, ): RandomizationInterval(...),
    (2, ): RandomizationInterval(...)
    ...
}
>>> ri_dict[(1, )]
RandomizationInterval(...)

Look at Randomization Interval for multiple aggregate results.

>>> rr = groupby_agg_view(ds, ["YEAR", "WEEK"], agg_attr_and_type=[("mean", "AMOUNT"), ("sum", "AMOUNT")]).run()
>>> ri_dict = rr.metadata
>>> ri_dict[(2020, 1)][0]
RandomizationInterval(...)

Machine Learning

Unsupervised learning

leapyear.analytics.kmeans(xs, dataset, n_iters=10, n_clusters=3)

Analysis: K-means clustering.

Identifies centers of clusters for a set of data points, by

  1. Randomly initializing a chosen number of cluster centers (centroids) in the feature space

  2. Associating each data point with the nearest centroid

  3. Iteratively adjusting centroids to locations based on differentially private computation of the mean for each feature

Parameters
  • xs (List[str]) – A list of attribute names that are the features.

  • dataset (DataSet) – The DataSet of the attributes.

  • n_iters (int) – Number of iterations to run k-means for

  • n_clusters (int) – Number of clusters to generate

Returns

Analysis object that can be executed using the run() method. Once executed, it would output clustering analysis results, such as centroids.

Return type

ClusteringAnalysis

leapyear.analytics.eval_kmeans(centroids, xs, dataset)

Analysis: Evaluate the K-means model.

Evaluate the clustering model by computing the Normalized Intra Cluster Variance (NICV).

Parameters
  • centroids (ClusterModel) – The model (generated using kmeans) to evaluate

  • xs (List[str]) – A list of attribute names that are the features.

  • dataset (DataSet) – The DataSet of the attributes.

Returns

Analysis representing evaluation of a clustering model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.pca(xs, dataset, **kwargs)

Principal Component Analysis.

Compute the Principal Component Analysis (PCA) of the set of attributes using a differentially private algorithm.

NOTE: This analysis does not require run().

Parameters
  • xs (List[str]) – A list of attribute names representing features to be considered for this analysis.

  • dataset (DataSet) – DataSet that includes these attributes.

  • max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to waiting forever.

Return type

Tuple[ndarray, ndarray]

Returns

  • explained_variances – Variance explained by each of the principal components - in other words, variance of each principal component coordinate when considered as feature on the input dataset.

  • pca_matrix – Transformation matrix, that can be used to translate original features to principal component coordinates. If all principal components are included, this becomes a square matrix corresponding to orthogonal transformation (e.g. reflection).

    This matrix can be used to generate principal component features using leapyear.dataset.DataSet.transform() operation, as in:

    tfds = ds.transform(x_vars, pca_matrix, 'pca')

    NOTE: Signs may not match PCA transformation matrix computed by scikit-learn.

Supervised learning

leapyear.analytics.basic_linreg(xs, y, dataset, *, affine=True, l1reg=0.0, l2reg=1.0, parameter_bounds=None)

Analysis: Linear regression.

Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features.

Note

To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using leapyear.feature.BoundsScaler and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • affine (bool) – If True, fit an intercept term.

  • l1reg (float) – The L1 regularization. Default value: 0.0.

  • l2reg (float) – The L2 regularization. Default value: 1.0. Must be at least 0.0001 to limit the randomization effect for models optimized via objective perturbation.

  • parameter_bounds (Optional[List[Tuple[float, float]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value: -10.0 to 10.0 for each parameter

Returns

Analysis representing the regression problem. It can be executed using the run() method to output calibrated model.

Return type

GenLinAnalysis

leapyear.analytics.generalized_linreg(xs, y, dataset, *, affine=True, l2reg=1.0, weight=None, offset=None, max_iters=25, family='gaussian', link='identity', link_power=0, variance_power=1)

Analysis: Generalized linear regression.

Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.

Available generalizations include:

  • offset of outputs based on pre-existing model - this enables modeling of residual

  • use of alternative link functions applied to the linear combination of features

  • application of regularization and weights during model optimization.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of Attributes or attribute names to be used as features.

  • y (Union[Attribute, str]) – The Attribute or attribute name of the target.

  • dataset (DataSet) – The DataSet of the attributes.

  • affine (bool) – True if the algorithm should fit an intercept term.

  • l2reg (float) – The L2 regularization parameter. Must be non-negative.

  • weight (Union[Attribute, str, None]) – Optional column used to weight each sample. Implies generalized regression.

  • offset (Union[Attribute, str, None]) – Optional column for offset in offset regression. Implies generalized regression.

  • max_iters (Optional[int]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.

  • family (Optional[str]) – Optional distribution of the label. Implies generalized regression. Possible values here are ‘gaussian’ (the default), ‘poisson’, ‘gamma’ and ‘tweedie’.

  • link (Optional[str]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values depend on family: ‘gaussian’ supports only ‘identity’ (default), ‘log’ and ‘inverse’; ‘poisson’ supports only ‘log’ (default), ‘identity’ and ‘sqrt’; ‘gamma’ supports only ‘inverse’ (default), ‘identity’ and ‘log’. There is no link function for the ‘tweedie’ family, use variance_power and link_power parameters instead.

  • link_power (Optional[int]) – For the ‘tweedie’ distribution only, the exponent of the link function. Default value is 0, which is equivalent to ‘identity’ link.

  • variance_power (Optional[int]) – For the ‘tweedie’ distribution only, the exponent of the variance. Default value is 1, which is equivalent to ‘gaussian’ family.

Returns

Analysis of the regression problem, which could be executed using run() function to output calibrated model.

Return type

GenLinAnalysis

leapyear.analytics.logreg(xs, y, dataset, affine=True, l1reg=0.0, l2reg=1.0)

Analysis: Logistic regression.

Implements a differentially private algorithm to represent outcome (target) variable as a logit-transformation of a linear combination of selected features. Trains using the “basic” algorithm.

Available generalizations include

  • regularization applied during model optimization

Note

To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using leapyear.feature.BoundsScaler and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • affine (bool) – If True, fit an intercept term.

  • l1reg (float) – The L1 regularization. Default value: 0.0.

  • l2reg (float) – The L2 regularization. Default value: 1.0.

Returns

Analysis training the logistic regression model. It can be executed using the run() method to output the calibrated model.

Return type

GenLinAnalysis

leapyear.analytics.generalized_logreg(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, weight=None, offset=None, max_iters=25, link='logit')

Analysis: Generalized logistic regression.

Implements a differentially private algorithm to represent the outcome (target) variable as a logit-transformation of a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.

Available generalizations include:

  • offset of outputs based on pre-existing model - this enables modeling of residual

  • use of alternative link functions applied to the linear combination of features

  • application of regularization and weights during model optimization.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of Attributes or attribute names to be used as features.

  • y (Union[Attribute, str]) – The Attribute or attribute name of the target.

  • dataset (DataSet) – The DataSet of the attributes.

  • affine (bool) – True if the algorithm should fit an intercept term.

  • l2reg (float) – The L2 regularization parameter. Must be non-negative.

  • weight (Union[Attribute, str, None]) – Optional column used to weight each sample. Implies generalized regression.

  • offset (Union[Attribute, str, None]) – Optional column for offset in offset regression. Implies generalized regression.

  • max_iters (Optional[int]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.

  • link (Optional[str]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values are ‘logit’ (default), ‘probit’ and ‘cloglog’.

Returns

Analysis of the regression problem, which could be executed using run() function to output calibrated model.

Return type

GenLinAnalysis

leapyear.analytics.gradient_boosted_tree_classifier(xs, y, dataset, max_depth=3, max_iters=5, max_bins=32)

Analysis: Gradient boosted tree classifier.

This analysis trains a randomized variant of gradient boosted tree classifier to predict a BOOLEAN outcome (target).

The algorithm works by iteratively training individual decision trees to predict a “residual” of the model built so far, and then integrating each newly built decision tree into the ensemble model to better predict the probability of the positive label.

Weights are used at different stages:

  • during training of individual decision trees, to focus attention on the areas where the model consistently underperforms, and

  • when combining individual decision trees to predict probability of the positive label.

Calibrated level of randomization is applied to individual leaves of the decision trees to help protect privacy of the individual records used for model training.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attributes or attribute names that are used as explanatory features for the analysis. Each attribute must be either BOOL, INT, REAL or FACTOR. Nullable types are not supported and must be converted to non-nullable - e.g. via coalesce.

  • y (Union[Attribute, str]) – The attribute or attribute name that is used as an outcome (target) of the classification model. Must be BOOLEAN type, as only binary classification models are supported. Nullable types are not supported and must be converted to non-nullable - e.g. via coalesce.

  • dataset (DataSet) – The DataSet containing both explanatory features and outcome attributes.

  • max_depth (int) – The maximum depth (or height) of any tree in the ensemble produced by the algorithm. Default: 3

  • max_iters (int) – The maximum number of iterations of the algorithm. This corresponds to the maximum number of individual decision trees in the ensemble. Default: 5

  • max_bins (int) –

    The maximum number of bins for features used in constructing trees. Default: 32

    Note

    Maximum number of bins should be set to no less than the number of distinct possible values of the FACTOR attributes used as explanatory features.

Returns

Analysis that will train the gradient boosted tree classifier. It can be executed using the run() method.

Return type

GradientBoostedTreeClassifierModelAnalysis

leapyear.analytics.random_forest(xs, y, dataset, n_trees=100, height=3)

Analysis: Random Forest Classifier.

Generate a random forest model to predict probability associated with each target class.

Random forests combine many decision trees in order to reduce the risk of overfitting.

Each decision tree is developed on a random subset of observations - and is limited to prescribed height.

Individual node split decisions are made to maximize split value (or gain) - with a variation that a differentially private algorithm is used to count the number of observations belonging to each target class on both sides of the split.

Specifically, split value (or gain) is defined as reduction in combined Gini impurity measure, associated with introducing the split for a given parent node. Here

  • Gini impurity for any given node (parent or child) is calculated based on distribution of observations within the node across different outcome (target) classes

  • To compute combined impurity of the pair of nodes, individual node impurities for the two children nodes are averaged proportionately to their share of observations

Categorical features are typically handled by evaluating various splits corresponding to random subsets of the available categories.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are used as features for explanatory analysis.

  • y (Union[Attribute, str]) – The attribute name that is the outcome (target).

  • dataset (DataSet) – The DataSet containing both explanatory features and outcome attributes.

  • n_trees (int) – The number of trees to use in the random forest. Default: 100

  • height (int) – The maximum height of the trees. Default: 3

Returns

Analysis training the random forest model. It can be executed using the run() method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.

Return type

ForestModelClassifierAnalysis

leapyear.analytics.regression_trees(xs, y, dataset, n_trees=100, height=3)

Analysis: Random Forest Regressor (regression trees).

Generate a regression trees model to predict value of target variable.

Regression trees are built similarly to random forests, but instead of predicting the probability that the target variable takes a certain categorical value (i.e., classification), they predict a real value of the target variable (i.e., regression).

The impurity metric in this case is the variance of the target variable for the datapoints that fall into the current node’s partition.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are used as features for explanatory analysis.

  • y (Union[Attribute, str]) – The attribute name that is the outcome (target).

  • dataset (DataSet) – The DataSet containing both explanatory features and target attribute.

  • n_trees (int) – The number of trees to use in the random forest. Default: 100.

  • height (int) – The maximum height of the trees. Default: 3.

Returns

Analysis training the regression trees model. It can be executed using the run() method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.

Return type

ForestModelRegressionAnalysis

leapyear.analytics.eval_linreg(glm, xs, y, dataset, metric='mse')

Analysis: Evaluate a linear regression model.

Parameters
  • glm (GLM) – The model (generated using linreg) to evaluate

  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • metric (Union[str, Metric]) –

    Linear regression evaluation metric: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.

    Note

    During the calculation of mse and mae metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.

Returns

Analysis representing evaluation of a regression model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.eval_logreg(glm, xs, y, dataset, metric='accuracy')

Analysis: Evaluate a logistic regression model.

Parameters
  • glm (GLM) – The model (generated using logreg) to evaluate

  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • metric (Union[str, Metric]) – Logistic regression evaluation metric. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.

Returns

Analysis representing evaluation of a logistic regression model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.eval_multinomial_logreg(glm, xs, y, dataset, metric='accuracy')

Analysis: Evaluate a multinomial logistic regression model.

Note

This method has been disabled for now, pending important improvements.

Parameters
  • glm (GLM) – The model (generated using multinomial_regression) to evaluate

  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • metric (Union[str, Metric]) – Multinomial regression evaluation metric (‘logloss’, ‘accuracy’)

Returns

Analysis representing evaluation of a multinomial regression model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

Deprecated since version This: method has been disabled for now, pending important improvements.

leapyear.analytics.eval_gbt_classifier(gbt, xs, y, dataset, metric='accuracy')

Analysis: Evaluate a gradient boosted tree (GBT) classifier model.

Parameters
  • gbt (GradientBoostedTreeClassifier) – The model to evaluate.

  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • metric (Union[str, Metric]) – GBT evaluation metric. Currently only supports ‘accuracy’.

Returns

Analysis representing evaluation of a GBT classifier model. It can be executed using the run() method to output the value of the evaluation metric.

Return type

ScalarAnalysis

leapyear.analytics.eval_random_forest(rf, xs, y, dataset, metric='accuracy')

Analysis: Evaluate a random forest model.

Parameters
  • rf (RandomForestClassifier) – The model (generated using random_forest) to evaluate

  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • metric (Union[str, Metric]) – Forest evaluation metric. Examples: ‘logloss’, ‘accuracy’, ‘auroc’, ‘aupr’

Returns

Analysis representing evaluation of a random forest model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.eval_regression_trees(rf, xs, y, dataset, metric='mse')

Analysis: Evaluate a regression trees model.

Parameters
  • rf (RandomForestClassifier) – The model (generated using regression_trees) to evaluate

  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • metric (Union[str, Metric]) –

    Model evaluation metric. Examples: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.

    Note

    During the calculation of mse and mae metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.

Returns

Analysis representing evaluation of a regression trees model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.roc(model, xs, y, dataset, thresholds=5)

Compute the ConfusionCurves.

For each threshold value, compute the normalized confusion matrix using the model. The confusion matrix contains the true positive rate, the true negative rate, the false positive rate and the false negative rate.

Parameters
  • model (Union[GLM, RandomForestClassifier]) – The model to evaluate the confusion curves on.

  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • thresholds (Union[int, Sequence[float]]) – If int, then generate approximatly thresholds (rounded to the closest power of 2) number of thresholds using recursive medians. If a sequence of floats, then use the list as the thresholds.

Returns

Analysis of the confusion curve, which can be executed using the run() method to output various evaluation metrics.

Return type

ConfusionModelAnalysis

leapyear.analytics.cross_val_score_linreg(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, cv=3, metric='mean_squared_error', parameter_bounds=None)

Analysis: Compute the linear regression cross validation score of the set of attributes.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • affine (bool) – If True, fit an intercept term.

  • l1reg (float) – The L1 regularization. Default value: 1.0.

  • l2reg (float) – The L2 regularization. Default value: 1.0. Must be at least 0.0001 to limit the randomization effect.

  • cv (int) – Number of folds in k-fold cross validation.

  • metric (Union[str, Metric]) – The metric for evaluating the regression. Examples: ‘mae’, ‘mse’, ‘r2’.

  • parameter_bounds (Optional[List[Tuple[float, float]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value: -10.0 to 10.0 for each parameter

Returns

Analysis of the cross-validation scores for the regression model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

leapyear.analytics.cross_val_score_logreg(xs, y, dataset, cv=3, affine=True, l1reg=0.1, l2reg=0.1, metric='accuracy')

Analysis: Compute the logistic regression cross validation score of the set of attributes.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • affine (bool) – If True, fit an intercept term.

  • l1reg (float) – The L1 regularization. Default value: 0.1.

  • l2reg (float) – The L2 regularization. Default value: 0.1. Must be at least 0.0001 to limit the randomization effect.

  • cv (int) – Number of folds in k-fold cross validation.

  • metric (Union[str, Metric]) – The metric for evaluating the logistic regression. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.

Returns

Analysis of the cross-validation scores for the regression model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

leapyear.analytics.cross_val_score_multinomial(xs, y, dataset, cv=3, affine=True, l1reg=0.1, l2reg=0.1, metric='accuracy')

Analysis: Compute the multinomial regression cross validation score of the set of attributes.

Note

This method has been disabled for now, pending important improvements.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • affine (bool) – If True, fit an intercept term.

  • l1reg (float) – The L1 regularization. Default value: 1.0.

  • l2reg (float) – The L2 regularization. Default value: 1.0. Must be at least 0.0001 to limit the randomization effect.

  • cv (int) – Number of folds in k-fold cross validation.

  • metric (Union[str, Metric]) – The metric for evaluating a given multinomial regression model. Examples: ‘accuracy’, ‘logloss’.

Returns

Analysis of the cross-validation scores for the regression model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

Deprecated since version This: method has been disabled for now, pending important improvements.

leapyear.analytics.cross_val_score_random_forest(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')

Analysis: Compute the random forest cross validation score of the set of attributes.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • n_trees (int) – Number of trees.

  • height (int) – Maximum height of trees.

  • cv (int) – Number of folds in k-fold cross validation

  • metric (Union[str, Metric]) – The metric for evaluating the regression

Returns

Analysis of the cross-validation scores for the random forest model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

leapyear.analytics.cross_val_score_regression_trees(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')

Analysis: Compute the regression trees cross validation score of the set of attributes.

Parameters
  • xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

  • y (Union[Attribute, str]) – The attribute name that is the outcome.

  • dataset (DataSet) – The DataSet of the attributes.

  • n_trees (int) – Number of trees.

  • height (int) – Maximum height of trees.

  • cv (int) – Number of folds in k-fold cross validation

  • metric (Union[str, Metric]) – The metric for evaluating the regression

Returns

Analysis of the cross-validation scores for the regression trees model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

leapyear.analytics.hyperopt_linreg(xs, y, dataset, *, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None, parameter_bounds=None)

Analysis: Hyperparameter optimization for linear regression.

Calibrate a linear regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.
Parameters
  • xs (List[Union[Attribute, str]]) – A list of attributes that are the features.

  • y (Union[Attribute, str]) – The target attribute.

  • dataset (DataSet) – The dataset containing the attributes.

  • cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.

  • train_fraction (float) – The fraction of the dataset to use set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.

  • metric (Union[str, Metric]) – Model performance metric to optimize. Examples: ‘mean_squared_error’, ‘mean_absolute_error’, ‘r2’

  • n_iter (int) – The number of optimization steps. Default: 100

  • l1_bounds (Tuple[float, float]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)

  • l2_bounds (Tuple[float, float]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)

  • fit_intercept (Optional[bool]) – If None, search will consider both options.

  • parameter_bounds (Optional[List[Tuple[float, float]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value: -10.0 to 10.0 for each parameter

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

  1. model calibrated with recommended hyperparameters and

  2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

See also

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

leapyear.analytics.hyperopt_logreg(xs, y, dataset, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None)

Analysis: Hyperparameter optimization for logistic regression.

Calibrate a logistic regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.
Parameters
  • xs (List[Union[Attribute, str]]) – A list of attributes that are the features.

  • y (Union[Attribute, str]) – The target attribute.

  • dataset (DataSet) – The dataset containing the attributes.

  • cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.

  • train_fraction (float) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.

  • metric (Union[str, Metric]) – Model performance metric to optimize. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’.

  • n_iter (int) – The number of optimization steps. Default: 100

  • l1_bounds (Tuple[float, float]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)

  • l2_bounds (Tuple[float, float]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)

  • fit_intercept (Optional[bool]) – If None, search will consider both options.

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

  1. model calibrated with recommended hyperparameters and

  2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

See also

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

leapyear.analytics.hyperopt_multinomial(xs, y, dataset, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None)

Analysis: Hyperparameter optimization for multinomial logistic regression.

Note

This method has been disabled for now, pending important improvements.

Calibrate a multinomial logistic regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.
Parameters
  • xs (List[Union[Attribute, str]]) – A list of attributes that are the features.

  • y (Union[Attribute, str]) – The target attribute.

  • dataset (DataSet) – The dataset containing the attributes.

  • cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.

  • train_fraction (float) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.

  • metric (Union[str, Metric]) – Model performance metric to optimize. Examples: ‘accuracy’, ‘logloss’

  • n_iter (int) – The number of optimization steps. Default: 100.

  • l1_bounds (Tuple[float, float]) – Lower and upper bounds for l1 regularization. Default: (1E-10, 1E10)

  • l2_bounds (Tuple[float, float]) – Lower and upper bounds for l2 regularization. Default: (1E-10, 1E10)

  • fit_intercept (Optional[bool]) – If None, search will consider both options.

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

  1. model calibrated with recommended hyperparameters and

  2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

See also

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

Deprecated since version This: method has been disabled for now, pending important improvements.

leapyear.analytics.hyperopt_rf(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)

Analysis: Hyperparameter optimization for a random forest model.

Calibrate a random forest model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.
Parameters
  • xs (List[Union[Attribute, str]]) – A list of attributes that are the features.

  • y (Union[Attribute, str]) – The target attribute.

  • dataset (DataSet) – The dataset containing the attributes.

  • cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.

  • train_fraction (float) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.

  • metric (Union[str, Metric]) – The metric to optimize. Examples: ‘accuracy’, ‘logloss’, ‘auroc’, ‘aupr’

  • n_iter (int) – The number of optimization steps. Default: 100

  • max_trees (int) – Maximum number of trees. Default: 1000

  • max_depth (int) – Maximum tree depth. Default: 20

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

  1. model calibrated with recommended hyperparameters and

  2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

See also

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

leapyear.analytics.hyperopt_regression_trees(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)

Analysis: Hyperparameter optimization for a regression trees model.

Calibrate a regression trees model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
    pick a set of hyperparameters (hp) to test based on cv_history.
    use hp to calibrate a model on each cross-validation set
    evaluate it on corresponding sample set-aside for cross-validation
    compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.
Parameters
  • xs (List[Union[Attribute, str]]) – A list of attributes that are the features.

  • y (Union[Attribute, str]) – The target attribute.

  • dataset (DataSet) – The dataset containing the attributes.

  • cv (int) – The number of cross-validation steps to perform for each candidate set of hyperparameters.

  • train_fraction (float) – The fraction of the dataset to set aside for model training and cross-validation – to be split further according to k-fold cross-validation strategy.

  • metric (Union[str, Metric]) – The metric to optimize. Examples: ‘mae’, ‘mse’, ‘r2’

  • n_iter (int) – The number of optimization steps. Default: 100

  • max_trees (int) – Maximum number of trees. Default: 1000

  • max_depth (int) – Maximum tree depth. Default: 20

Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

  1. model calibrated with recommended hyperparameters and

  2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

See also

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

Context Managers

leapyear.analytics.ignore_computation_cache()

Temporary context where computations do not utilize the computation cache.

The computation cache is intended to prevent wasting privacy exposure on queries that were previously computed. Entering this context manager will disable the use of the cache and allow repeated computations to return different differentially private answers.

Example

An administrator wants to run a count multiple times to estimate the random distribution of responses around the precise value.

>>> with ignore_computation_cache():
>>>     results = [la.count_rows(table).run() for _ in range(10)]

See also

  • To override the behavior for a single computation, see the cache keyword argument in run() or check().

  • The default_analysis_caching keyword argument in Client will temporarily be overwritten within this context manager.

Note

Additional permissions may be required to disable the computation cache.

Return type

None

leapyear.analytics.precise_computations(precise=True)

Temporary context specifying if the computations are precise or not.

Computations requested within this context would be executed in precise mode, where differential privacy is not applied.

Parameters

precise (bool) – True to enable precise computations within the context, False to disable them.

Example

An administrator wants to compare the responses of a number of computations with and without differential privacy applied. Precise mode may not be available for all computations.

>>> def my_computation():
>>>     symbols = ("APPL", "GOOG", "MSFT"):
>>>     return [la.count_rows(table.where(col("SYM") == lit(val)).run() for val in symbols]
>>>
>>> res_dp = my_computation()
>>> with precise_computations():
>>>    res_no_dp = my_computation()

See also

  • To override the behavior for a single computation, see the precise keyword argument in run() or check().

Note

Additional permissions may be necessary to enable precise computations.

Return type

None

Save/Load Models

LeapYear save and load machine learning models utilities.

leapyear.ml_import_export.save(model, path_or_fd)

Save machine learning models in json to either a file or a file-like object.

Parameters

Example

>>> from leapyear.ml_import_export import save
>>> save(model, 'model.json')
Return type

None

leapyear.ml_import_export.load(path_or_fd, expected_model_type=None, **kwargs)

Load machine learning models from a file-like object.

Parameters
  • path – The path in the file system or an in-memory stream from where to load the model.

  • expected_mode_type – If None it won’t check that the model being loaded is of the type specified. Otherwise it checks that the model loaded is of the type expected.

  • rf_type – When loading RandomForest models with serialization number 0, setting this to “classification” or “regression” will load the model as a RandomForestClassifier or RandomForestRegressor objects, respectively. If not specified, a RandomForest model will raise an error. The value is ignored for all other model types.

Examples

  1. Loading a previously saved model of unspecified type

>>> from leapyear.ml_import_export import load
>>> model = load('model.json')
  1. Loading a previously saved RandomForestClassifier model

>>> from leapyear.ml_import_export import load
>>> model = load('random_forest_classifier.json', RandomForestClassifier)
Return type

Union[ClusterModel, GLM, GradientBoostedTreeClassifier, RandomForestClassifier, RandomForestRegressor]