# Module leapyear.analytics¶

Statistics and machine learning algorithms.

LeapYear analyses are functions that are executed by the server to compute statistics or to perform machine learning tasks on DataSets. These functions return an Analysis type, which is executed on the server by calling the run() method.

For simple statistics, such as count() or mean(), the values can be extracted using the following pattern:

>>> from leapyear import Client, DataSet
>>> from leapyear.analytics import count_rows, mean
>>> dataset = DataSet.from_table('db.table')
>>> dataset_rows_analysis = count_rows(dataset)
>>> n_rows = dataset_rows_analysis.run()
>>> print(n_rows)
10473
>>> dataset_mean_x_analysis = mean('x0', dataset)
>>> mean_x = dataset_mean_x_analysis.run()
>>> print(mean_x)
5.234212346345


The computation of all univariate statistics follows the pattern for mean(). For more complicated machine learning tasks, multiple columns must be specified, depending on the task.

Unsupervised learning tasks (like clustering) will generally require the specification of which features in the DataSet to use. Supervised learning tasks (like regression) will additionally require the specification of a target variable.

For example, we can train a linear regression model as follows:

>>> from leapyear.analytics import generalized_linreg
>>> regression = generalized_linreg(['x0', 'x1'], 'y', dataset, affine=True, l2reg=1.0)
>>> model = regression.run()


Helper routines are available for performing cross-validation (see cross_val_score_linreg()). Note that, unlike other analyses, they are immediately executed (without calling run()):

>>> from leapyear.analytics import cross_val_score_linreg
>>> cross_val_score = cross_val_score_linreg(
>>>     ['x0', 'x1'], 'y', dataset, cv=3,
>>>     affine=True, l1reg=0.1, l2reg=1.0, scorer='mse'
>>> )


## Data Analysis¶

leapyear.analytics.count(attr, dataset=None, drop_nulls=False)

Analysis: Count the elements of an attribute.

This analysis can be executed using the run method to compute the approximate count of elements, including NULL values.

The user can request additional information about the computation with run(rich_result=True). In this case, an object of RandomizationInterval, will be generated likely including the precise value of the computation on the data sample.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the count of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to ignore NULL values. Default: False.

Returns

Analysis object that can be executed using the run method.

Return type

CountAnalysisWithRI

leapyear.analytics.count_rows(dataset)

Analysis: Count the number of rows in a dataset.

This analysis can be executed using the run method to compute the approximate number of rows in the dataset.

The user can request additional information about the computation with run(rich_result=True). In this case, an object of RandomizationInterval will be generated, likely including the precise value of the computation on the data sample.

Parameters

dataset (DataSet) – The input dataset.

Returns

Analysis object that can be executed using the run method.

Return type

CountAnalysisWithRI

leapyear.analytics.count_distinct(attr, dataset=None, drop_nulls=False)

Analysis: Count the unique elements of an attribute.

Parameters
Returns

Prepared analysis of the count.

Return type

Analysis

leapyear.analytics.count_distinct_rows(dataset)

Analysis: Count the number of distinct rows in a dataset.

Returns

Analysis for counting the number of distinct rows.

Return type

ScalarAnalysis

leapyear.analytics.mean(attr, dataset=None, drop_nulls=False)

Analysis: Compute the mean of an attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the computation to go through.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the mean of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithCI

leapyear.analytics.sum(attr, dataset=None, drop_nulls=False)

Analysis: Compute the sum of a numeric attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the copmutation to go through.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the sum of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithRI

leapyear.analytics.variance(attr, dataset=None, drop_nulls=False)

Analysis: Compute the variance of an attribute.

This analysis can be executed using the run method to compute the approximate mean of the attribute.

The user can request additional information about the computation with run(rich_result=True). In this case, the RandomizationInterval object will be generated, likely including the precise value of the computation on the data sample.

Note: If the attribute is nullable, setting drop_nulls=True is necessary for the copmutation to go through.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the variance of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring NULL values.

Returns

Analysis object that can be executed using the run method.

Return type

ScalarAnalysisWithCI

leapyear.analytics.min(attr, dataset=None, drop_nulls=False)

Analysis: Compute the minimum value of an attribute.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the min of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the min.

Return type

ScalarAnalysis

Note

The minimum reported is the 1/1000 quantile of the attribute.

When the attribute being analyzed has a very narrow range of possible values, the minimum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the minimum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the minimum, and rescale the returned value by width/10.

The result of this analysis may be very different from the true minimum of the data sample in the following two scenarios:

1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the minimum computed is the 1/1000 quantile of the attribute, and

2. When the public lower bound is very different from the true minimum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.

leapyear.analytics.max(attr, dataset=None, drop_nulls=False)

Analysis: Compute the maximum value of an attribute.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the max of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the max.

Return type

Analysis

Note

The maximum reported is the 999/1000 quantile of the attribute.

When the attribute being analyzed has a very narrow range of possible values, the maximum returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the maximum returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the maximum, and rescale the returned value by width/10.

The result of this analysis may be very different from the true maximum of the data sample in the following two scenarios:

1. When the underlying attribute distribution has significant outliers (e.g. a very long tail) - this is because the maximum computed is the 999/1000 quantile of the attribute, and

2. When the public upper bound is very different from the true maximum of the data sample - this is because differential privacy is aiming to minimize the effect of individual records on the output.

leapyear.analytics.median(attr, dataset=None, drop_nulls=False)

Analysis: Compute the median value of an attribute.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the median of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the median.

Return type

Analysis

Note

When the attribute being analyzed has a very narrow range of possible values, the median returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the median returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the median, and rescale the returned value by width/10.

leapyear.analytics.quantile(q, attr, dataset=None, drop_nulls=False)

Analysis: Compute a certain quantile q of an attribute.

Parameters
• q (float) – Quantile to compute, which must be between 0 and 1 inclusive.

• attr (Union[Attribute, str]) – The attribute to compute the quantile of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the quantile.

Return type

Analysis

Note

When the attribute being analyzed has a very narrow range of possible values, the quantile returned may be inaccurate. As an extreme example, if the width of the interval of possible values of an attribute is less than 0.01, the quantile returned will be a fixed number that does not depend on the data distribution. For such cases, scale the attribute by 10/width, compute the quantile, and rescale the returned value by width/10.

leapyear.analytics.skewness(attr, dataset=None, drop_nulls=False)

Analysis: Compute the skewness of an attribute.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the skewness of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the skewness.

Return type

Analysis

leapyear.analytics.kurtosis(attr, dataset=None, drop_nulls=False)

Analysis: Compute the excess kurtosis of an attribute.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the kurtosis of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the kurtosis.

Return type

Analysis

leapyear.analytics.iqr(attr, dataset=None, drop_nulls=False)

Analysis: Compute the interquartile range of an attribute.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the interquartile range of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

Returns

Prepared analysis of the iqr.

Return type

Analysis

leapyear.analytics.histogram(attr, dataset=None, bins=10, interval=None)

Analysis: Compute the histogram of the attribute in the dataset.

Parameters
Returns

Prepared analysis of the histogram.

Return type

Analysis

leapyear.analytics.histogram2d(x_attr, y_attr, dataset=None, x_bins=10, y_bins=10, x_range=None, y_range=None)

Analysis: Compute the 2D histogram of two attributes in the dataset.

Parameters
Returns

Prepared analysis of the histogram.

Return type

Analysis

leapyear.analytics.correlation_matrix(xs, dataset, *, center=True, scale=True, **kwargs)

Analysis: Compute the correlation matrix of the set of attributes.

NOTE: This analysis does not require run().

Parameters
• xs (List[str]) – A list of attribute names to compute correlation matrix for.

• dataset (DataSet) – The DataSet containing these attributes.

• center (bool) – Whether to center the columns before computing correlation matrix. If False, proceed assuming the columns are already centered.

• scale (bool) – Whether to divide covariance matrix by number of rows. If False, do not divide.

• max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to 300.0 (5 minutes).

Returns

The correlation matrix.

Return type

np.ndarray

leapyear.analytics.covariance_matrix(xs, dataset, *, center=True, scale=True, **kwargs)

Analysis: Compute the covariance matrix of the set of attributes.

NOTE: This analysis does not require run().

Parameters
• xs (List[str]) – A list of attribute names that are the features.

• dataset (DataSet) – The DataSet of the attributes.

• center (bool) – Whether to center the columns before compute the covariance matrix. If False, assume the columns are centered.

• scale (bool) – Whether to divide the matrix by number of rows. If False, do not divide.

• max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to 300.0 (5 minutes).

Returns

The covariance matrix.

Return type

np.ndarray

leapyear.analytics.describe(dataset, attributes=None)

Describe the columns of the dataset for use in data exploration.

The describe function provides a way for an analyst to perform initial rough data exploration on a dataset. To get more accurate statistics, the individual functions mean(), count(), et cetera, are recommended. This function does not use the analysis cache of the other statistics functions.

Numeric columns are described by their count, mean, standard deviation, minimum, maximum and the quartiles. Categorical columns (factors and booleans) are described by their count, distinct count and frequency of the most frequent element.

Parameters
Returns

Prepared analysis for describing the dataset. Execute the analysis using the run() method.

Return type

DescribeAnalysis

leapyear.analytics.groupby_agg_view(dataset, attrs, agg_attr=None, agg_type=<GroupByAggType.COUNT: 1>, *, max_groupby_agg_keys=5000, size_threshold=None)

Compute aggregate statistic within each group and output aggregate results.

Only groups with estimated size larger than minimum_dataset_size will be returned. This parameter can be set in run.

Parameters
Returns

Analysis object that can be executed using run method to return aggregation results. The results can be accessed as a pandas dataframe using .to_dataframe().

Return type

GroupbyAggAnalysis

Note: privacy exposure estimate for this analysis is not supported.

Example

For each age group and gender, compute the mean income.

>>> groupby_agg_view(ds, ["AGE", "GENDER"], 'INCOME', 'mean').run(minimum_dataset_size=1000)


### Data Cleaning¶

leapyear.analytics.count_cast(attr, dataset=None)

Analysis: Count the number of nulls that would result from converting to various types.

Note

This method has been disabled for now, pending important improvements.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the count of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

Returns

Prepared analysis of the count.

Return type

Analysis

Deprecated since version Will: be removed in LY 3.0

leapyear.analytics.guess_bounds(attr, dataset=None, drop_nulls=False, min_power=-5, max_power=100, base=2)

Analysis: Guess the bounds of an attribute.

Note

This method has been disabled for now, pending important improvements.

Parameters
• attr (Union[Attribute, str]) – The attribute to compute the bounds of. Either a standalone attribute or the name of an attribute from a dataset provided by dataset.

• dataset (Optional[DataSet]) – The dataset to use when attr is a string. When attr is an Attribute this field is ignored.

• drop_nulls (bool) – Whether to allow running on a nullable column, ignoring nulls.

• min_power (int) – The smallest power of the base to consider.

• max_power (int) – The largest power of the base to consider.

• base (float) – The base to use in the exponential search.

Returns

Prepared analysis of the bounds.

Return type

Analysis

Deprecated since version Will: be removed in LY 3.0

## Machine Learning¶

### Unsupervised learning¶

leapyear.analytics.kmeans(xs, dataset, n_iters=10, n_clusters=3)

Analysis: K-means clustering.

Identifies centers of clusters for a set of data points, by

1. Randomly initializing a chosen number of cluster centers (centroids) in the feature space

2. Associating each data point with the nearest centroid

3. Iteratively adjusting centroids to locations based on differentially private computation of the mean for each feature

Parameters
Returns

Analysis object that can be executed using the run() method. Once executed, it would output clustering analysis results, such as centroids.

Return type

ClusteringAnalysis

leapyear.analytics.eval_kmeans(centroids, xs, dataset)

Analysis: Evaluate the K-means model.

Evaluate the clustering model by computing the Normalized Intra Cluster Variance (NICV).

Parameters
Returns

Analysis representing evaluation of a clustering model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.pca(xs, dataset, **kwargs)

Principal Component Analysis.

Compute the Principal Component Analysis (PCA) of the set of attributes using a differentially private algorithm.

NOTE: This analysis does not require run().

Parameters
• xs (List[str]) – A list of attribute names representing features to be considered for this analysis.

• dataset (DataSet) – DataSet that includes these attributes.

• max_timeout_sec – Specifies the maximum amount of time (in seconds) the user is willing to wait for a response. If set to None, this function will poll the server indefinitely. If it is run with scale or center set to True, the timeout will be multiplied. Defaults to 300.0 (5 minutes).

Return type

Tuple[ndarray, ndarray]

Returns

• explained_variances – Variance explained by each of the principal components - in other words, variance of each principal component coordinate when considered as feature on the input dataset.

• pca_matrix – Transformation matrix, that can be used to translate original features to principal component coordinates. If all principal components are included, this becomes a square matrix corresponding to orthogonal transformation (e.g. reflection).

This matrix can be used to generate principal component features using leapyear.dataset.DataSet.transform() operation, as in:

tfds = ds.transform(x_vars, pca_matrix, 'pca')

NOTE: Signs may not match PCA transformation matrix computed by scikit-learn.

### Supervised learning¶

leapyear.analytics.linreg(xs, y, dataset, affine=True, l1reg=1.0, l2reg=1.0, weight=None, offset=None, max_iters=25, family=None, link=None, link_power=0, variance_power=1, parameter_bounds=None)

Analysis: Linear regression.

Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features. Trains using either the “basic” or the “glm” algorithm depending on parameters. Available generalizations include

• offset of linear combination based on pre-built model - this would enable modeling of residual

• a link function applied to the linear combination of features

• non-Gaussian distribution of outcome around the mean

• regularization and weights applied during model optimization

Note

This method has been deprecated. Use basic_linreg or generalized_linreg instead.

Parameters
• xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

• y (Union[Attribute, str]) – The attribute name that is the outcome.

• dataset (DataSet) – The DataSet of the attributes.

• affine (bool) – If True, fit an intercept term.

• l1reg (float) – The L1 regularization. Default value: 1.0.

• l2reg (float) – The L2 regularization. Default value: 1.0. Must be at least 0.0001 to limit the randomization effect for non-generalized models optimized via objective perturbation.

• weight (Union[Attribute, str, None]) – Optional column to weight each sample. Implies generalized regression.

• offset (Union[Attribute, str, None]) – Optional column for offset in offset regression. Implies generalized regression.

• max_iters (Optional[int]) – Optional maximum number of iterations for fitting the regression. Only relevant in the case of generalized regression.

• m – Optional maximum ratio of bounds at an iteration to the original bounds of y. This serves as an additional termination criterion. Default value is 10.

• family (Optional[str]) – Optional distribution of the label. Implies generalized regression. Possible values here are ‘gaussian’, ‘poisson’, ‘gamma’ and ‘tweedie’. Default for generalized regression is ‘gaussian’.

• link (Optional[str]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values depend on family: ‘gaussian’ supports only ‘identity’ (default), ‘log’ and ‘inverse’; ‘poisson’ supports only ‘log’ (default), ‘identity’ and ‘sqrt’; ‘gamma’ supports only ‘inverse’ (default), ‘identity’ and ‘log’. There is no link function for the ‘tweedie’ family, use variance_power and link_power parameters instead.

• link_power (Optional[int]) – For the ‘tweedie’ distribution only, the exponent of the link function. Default value is 0, which is equivalent to ‘identity’ link.

• variance_power (Optional[int]) – For the ‘tweedie’ distribution only, the exponent of the variance. Default value is 1, which is equivalent to ‘gaussian’ family.

• parameter_bounds (Optional[List[Tuple[float, float]]]) – Restriction on the model parameters, including the intercept. Required for differential privacy. Default value: -10.0 to 10.0 for each parameter

Returns

Analysis representing the regression problem. It can be executed using the run() method to output calibrated model.

Return type

GenLinAnalysis

Deprecated since version Use: basic_linreg or generalized_linreg instead.

leapyear.analytics.generalized_linreg(xs, y, dataset, *, affine=True, l2reg=1.0, weight=None, offset=None, max_iters=25, family='gaussian', link='identity', link_power=0, variance_power=1)

Analysis: Generalized linear regression.

Implements a differentially private algorithm to represent outcome (target) variable as a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.

Available generalizations include:

• offset of outputs based on pre-existing model - this enables modeling of residual

• use of alternative link functions applied to the linear combination of features

• application of regularization and weights during model optimization.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters
• xs (List[Union[Attribute, str]]) – A list of Attributes or attribute names to be used as features.

• y (Union[Attribute, str]) – The Attribute or attribute name of the target.

• dataset (DataSet) – The DataSet of the attributes.

• affine (bool) – True if the algorithm should fit an intercept term.

• l2reg (float) – The L2 regularization parameter.

• weight (Union[Attribute, str, None]) – Optional column used to weight each sample. Implies generalized regression.

• offset (Union[Attribute, str, None]) – Optional column for offset in offset regression. Implies generalized regression.

• max_iters (Optional[int]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.

• family (Optional[str]) – Optional distribution of the label. Implies generalized regression. Possible values here are ‘gaussian’ (the default), ‘poisson’, ‘gamma’ and ‘tweedie’.

• link (Optional[str]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values depend on family: ‘gaussian’ supports only ‘identity’ (default), ‘log’ and ‘inverse’; ‘poisson’ supports only ‘log’ (default), ‘identity’ and ‘sqrt’; ‘gamma’ supports only ‘inverse’ (default), ‘identity’ and ‘log’. There is no link function for the ‘tweedie’ family, use variance_power and link_power parameters instead.

• link_power (Optional[int]) – For the ‘tweedie’ distribution only, the exponent of the link function. Default value is 0, which is equivalent to ‘identity’ link.

• variance_power (Optional[int]) – For the ‘tweedie’ distribution only, the exponent of the variance. Default value is 1, which is equivalent to ‘gaussian’ family.

Returns

Analysis of the regression problem, which could be executed using run() function to output calibrated model.

Return type

GenLinAnalysis

leapyear.analytics.logreg(xs, y, dataset, affine=True, l1reg=0.0, l2reg=1.0)

Analysis: Logistic regression.

Implements a differentially private algorithm to represent outcome (target) variable as a logit-transformation of a linear combination of selected features. Trains using the “basic” algorithm.

Available generalizations include

• regularization applied during model optimization

Note

To help ensure that the differentially private training process can effectively optimize regression coefficients, it’s important to re-scale features (both dependent/target and independent/explanatory) to a similar domain (e.g. [0,1]). This can be done using leapyear.feature.BoundsScaler and will help ensure that the domain being searched for the coefficient will include the optimal model. See LeapYear guides for this and other recommendations on training accurate regressions.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters
Returns

Analysis training the logistic regression model. It can be executed using the run() method to output the calibrated model.

Return type

GenLinAnalysis

leapyear.analytics.multinomial_regression(xs, y, dataset, affine=True, l1reg=1.0, l2reg=1.0, weight=None)

Analysis: Multinomial logistic regression.

Note

This method has been disabled for now, pending important improvements.

Implements a differentially private algorithm to represent outcome (target) variable distribution across various classes as a softmax-transformation of a linear combination of selected features. In contrast to logreg(), this method allows the target variable to have more than two values.

Includes the possibility of specifying weights to be applied during model optimization.

Parameters
Returns

Analysis training the multinomial logistic regression model. It can be executed using the run() method to output the calibrated model.

Return type

GenLinAnalysis

Deprecated since version This: method has been disabled for now, pending important improvements.

leapyear.analytics.generalized_logreg(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, weight=None, offset=None, max_iters=25, link='logit')

Analysis: Generalized logistic regression.

Implements a differentially private algorithm to represent the outcome (target) variable as a logit-transformation of a linear combination of selected features. This computation is based on the iterative weighted least squares algorithm. Trains using the “glm” algorithm.

Available generalizations include:

• offset of outputs based on pre-existing model - this enables modeling of residual

• use of alternative link functions applied to the linear combination of features

• application of regularization and weights during model optimization.

Note

Differentially private regressions aim to optimize models for predictive tasks, while protecting sensitive information from being learned from the trained model. They are not guaranteed to result in coefficients close to those that a non-differentially private algorithm would learn. In other words, while regressions trained with differential privacy may excel at predictive tasks, keep in mind that these were not designed for inference.

Please see the guides for more details on using regressions in LeapYear.

Parameters
• xs (List[Union[Attribute, str]]) – A list of Attributes or attribute names to be used as features.

• y (Union[Attribute, str]) – The Attribute or attribute name of the target.

• dataset (DataSet) – The DataSet of the attributes.

• affine (bool) – True if the algorithm should fit an intercept term.

• l2reg (float) – The L2 regularization parameter.

• weight (Union[Attribute, str, None]) – Optional column used to weight each sample. Implies generalized regression.

• offset (Union[Attribute, str, None]) – Optional column for offset in offset regression. Implies generalized regression.

• max_iters (Optional[int]) – Optional maximum number of iterations for fitting the regression. Note that regardless of this setting, the system would often stop before reaching max_iterations - e.g. after a single iteration. In such cases, higher value for max_iterations may lead to less privacy allocated to each iteration, and ultimately, higher randomization effect.

• link (Optional[str]) – Optional link function between mean of label distribution and prediction. Implies generalized regression. Possible values are ‘logit’ (default), ‘probit’ and ‘cloglog’.

Returns

Analysis of the regression problem, which could be executed using run() function to output calibrated model.

Return type

GenLinAnalysis

leapyear.analytics.gradient_boosted_tree_classifier(xs, y, dataset, max_depth=3, max_iters=5, max_bins=32)

This analysis trains a randomized variant of gradient boosted tree classifier to predict a BOOLEAN outcome (target).

The algorithm works by iteratively training individual decision trees to predict a “residual” of the model built so far, and then integrating each newly built decision tree into the ensemble model to better predict the probability of the positive label.

Weights are used at different stages:

• during training of individual decision trees, to focus attention on the areas where the model consistently underperforms, and

• when combining individual decision trees to predict probability of the positive label.

Calibrated level of randomization is applied to individual leaves of the decision trees to help protect privacy of the individual records used for model training.

Parameters
• xs (List[Union[Attribute, str]]) – A list of attributes or attribute names that are used as explanatory features for the analysis. Each attribute must be either BOOL, INT, REAL or FACTOR. Nullable types are not supported and must be converted to non-nullable - e.g. via coalesce.

• y (Union[Attribute, str]) – The attribute or attribute name that is used as an outcome (target) of the classification model. Must be BOOLEAN type, as only binary classification models are supported. Nullable types are not supported and must be converted to non-nullable - e.g. via coalesce.

• dataset (DataSet) – The DataSet containing both explanatory features and outcome attributes.

• max_depth (int) – The maximum depth (or height) of any tree in the ensemble produced by the algorithm. Default: 3

• max_iters (int) – The maximum number of iterations of the algorithm. This corresponds to the maximum number of individual decision trees in the ensemble. Default: 5

• max_bins (int) –

The maximum number of bins for features used in constructing trees. Default: 32

Note

Maximum number of bins should be set to no less than the number of distinct possible values of the FACTOR attributes used as explanatory features.

Returns

Analysis that will train the gradient boosted tree classifier. It can be executed using the run() method.

Return type

leapyear.analytics.random_forest(xs, y, dataset, n_trees=100, height=3)

Analysis: Random Forest Classifier.

Generate a random forest model to predict probability associated with each target class.

Random forests combine many decision trees in order to reduce the risk of overfitting.

Each decision tree is developed on a random subset of observations - and is limited to prescribed height.

Individual node split decisions are made to maximize split value (or gain) - with a variation that a differentially private algorithm is used to count the number of observations belonging to each target class on both sides of the split.

Specifically, split value (or gain) is defined as reduction in combined Gini impurity measure, associated with introducing the split for a given parent node. Here

• Gini impurity for any given node (parent or child) is calculated based on distribution of observations within the node across different outcome (target) classes

• To compute combined impurity of the pair of nodes, individual node impurities for the two children nodes are averaged proportionately to their share of observations

Categorical features are typically handled by evaluating various splits corresponding to random subsets of the available categories.

Parameters
Returns

Analysis training the random forest model. It can be executed using the run() method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.

Return type

ForestModelAnalysis

leapyear.analytics.regression_trees(xs, y, dataset, n_trees=100, height=3)

Analysis: Random Forest Regressor (regression trees).

Generate a regression trees model to predict value of target variable.

Regression trees are built similarly to random forests, but instead of predicting the probability that the target variable takes a certain categorical value (i.e., classification), they predict a real value of the target variable (i.e., regression).

The impurity metric in this case is the variance of the target variable for the datapoints that fall into the current node’s partition.

Parameters
Returns

Analysis training the regression trees model. It can be executed using the run() method to output the analysis results which include the calibrated random forest model, feature importance statistics, etc.

Return type

ForestModelAnalysis

leapyear.analytics.eval_linreg(glm, xs, y, dataset, metric='mse')

Analysis: Evaluate a linear regression model.

Parameters
• glm (GLM) – The model (generated using linreg) to evaluate

• xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

• y (Union[Attribute, str]) – The attribute name that is the outcome.

• dataset (DataSet) – The DataSet of the attributes.

• metric (Union[str, Metric]) –

Linear regression evaluation metric: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.

Note

During the calculation of mse and mae metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.

Returns

Analysis representing evaluation of a regression model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.eval_logreg(glm, xs, y, dataset, metric='accuracy')

Analysis: Evaluate a logistic regression model.

Parameters
Returns

Analysis representing evaluation of a logistic regression model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.eval_multinomial_logreg(glm, xs, y, dataset, metric='accuracy')

Analysis: Evaluate a multinomial logistic regression model.

Note

This method has been disabled for now, pending important improvements.

Parameters
Returns

Analysis representing evaluation of a multinomial regression model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

Deprecated since version This: method has been disabled for now, pending important improvements.

leapyear.analytics.eval_gbt_classifier(gbt, xs, y, dataset, metric='accuracy')

Analysis: Evaluate a gradient boosted tree (GBT) classifier model.

Parameters
Returns

Analysis representing evaluation of a GBT classifier model. It can be executed using the run() method to output the value of the evaluation metric.

Return type

ScalarAnalysis

leapyear.analytics.eval_random_forest(rf, xs, y, dataset, metric='accuracy')

Analysis: Evaluate a random forest model.

Parameters
Returns

Analysis representing evaluation of a random forest model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.eval_regression_trees(rf, xs, y, dataset, metric='mse')

Analysis: Evaluate a regression trees model.

Parameters
• rf (RandomForest) – The model (generated using regression_trees) to evaluate

• xs (List[Union[Attribute, str]]) – A list of attribute names that are the features.

• y (Union[Attribute, str]) – The attribute name that is the outcome.

• dataset (DataSet) – The DataSet of the attributes.

• metric (Union[str, Metric]) –

Model evaluation metric. Examples: ‘mse’/’mean_squared_error’ or ‘mae’/’mean_absolute_error’.

Note

During the calculation of mse and mae metrics, the individual values of absolute error are restricted to be no greater than the length of the interval of possible values of the target attribute, as seen in the dataset schema. For example, if the target attribute contains values in the interval [-50, 50], then the absolute error of any individual prediction used in computing the mean will be no greater than 100.

Returns

Analysis representing evaluation of a regression trees model. It can be executed using the run() method to output evaluation metric value.

Return type

ScalarAnalysis

leapyear.analytics.roc(model, xs, y, dataset, thresholds=5)

Compute the ConfusionCurves.

For each threshold value, compute the normalized confusion matrix using the model. The confusion matrix contains the true positive rate, the true negative rate, the false positive rate and the false negative rate.

Parameters
Returns

Analysis of the confusion curve, which can be executed using the run() method to output various evaluation metrics.

Return type

ConfusionModelAnalysis

leapyear.analytics.cross_val_score_linreg(xs, y, dataset, *, affine=True, l1reg=1.0, l2reg=1.0, cv=3, metric='mean_squared_error', parameter_bounds=None)

Analysis: Compute the linear regression cross validation score of the set of attributes.

Parameters
Returns

Analysis of the cross-validation scores for the regression model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

leapyear.analytics.cross_val_score_logreg(xs, y, dataset, cv=3, affine=True, l1reg=0.1, l2reg=0.1, metric='accuracy')

Analysis: Compute the logistic regression cross validation score of the set of attributes.

Parameters
Returns

Analysis of the cross-validation scores for the regression model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

leapyear.analytics.cross_val_score_multinomial(xs, y, dataset, cv=3, affine=True, l1reg=0.1, l2reg=0.1, metric='accuracy')

Analysis: Compute the multinomial regression cross validation score of the set of attributes.

Note

This method has been disabled for now, pending important improvements.

Parameters
Returns

Analysis of the cross-validation scores for the regression model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

Deprecated since version This: method has been disabled for now, pending important improvements.

leapyear.analytics.cross_val_score_random_forest(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')

Analysis: Compute the random forest cross validation score of the set of attributes.

Parameters
Returns

Analysis of the cross-validation scores for the random forest model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

leapyear.analytics.cross_val_score_regression_trees(xs, y, dataset, n_trees=100, height=3, cv=3, metric='mean_squared_error')

Analysis: Compute the regression trees cross validation score of the set of attributes.

Parameters
Returns

Analysis of the cross-validation scores for the regression trees model. It can be executed using the run() method to generate cross-validation results.

Return type

CrossValidationAnalysis

leapyear.analytics.hyperopt_linreg(xs, y, dataset, *, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None, parameter_bounds=None)

Analysis: Hyperparameter optimization for linear regression.

Calibrate a linear regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
pick a set of hyperparameters (hp) to test based on cv_history.
use hp to calibrate a model on each cross-validation set
evaluate it on corresponding sample set-aside for cross-validation
compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters
Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

1. model calibrated with recommended hyperparameters and

2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

leapyear.analytics.hyperopt_logreg(xs, y, dataset, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None)

Analysis: Hyperparameter optimization for logistic regression.

Calibrate a logistic regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
pick a set of hyperparameters (hp) to test based on cv_history.
use hp to calibrate a model on each cross-validation set
evaluate it on corresponding sample set-aside for cross-validation
compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters
Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

1. model calibrated with recommended hyperparameters and

2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

leapyear.analytics.hyperopt_multinomial(xs, y, dataset, cv, train_fraction, metric, n_iter=100, l1_bounds=(1e-10, 10000000000.0), l2_bounds=(1e-10, 10000000000.0), fit_intercept=None)

Analysis: Hyperparameter optimization for multinomial logistic regression.

Note

This method has been disabled for now, pending important improvements.

Calibrate a multinomial logistic regression model by optimizing its cross-validation score with respect to model hyperparameters - L_1 and L_2 regularization parameters and presence of intercept.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
pick a set of hyperparameters (hp) to test based on cv_history.
use hp to calibrate a model on each cross-validation set
evaluate it on corresponding sample set-aside for cross-validation
compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters
Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

1. model calibrated with recommended hyperparameters and

2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

Deprecated since version This: method has been disabled for now, pending important improvements.

leapyear.analytics.hyperopt_rf(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)

Analysis: Hyperparameter optimization for a random forest model.

Calibrate a random forest model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
pick a set of hyperparameters (hp) to test based on cv_history.
use hp to calibrate a model on each cross-validation set
evaluate it on corresponding sample set-aside for cross-validation
compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters
Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

1. model calibrated with recommended hyperparameters and

2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

leapyear.analytics.hyperopt_regression_trees(xs, y, dataset, cv, train_fraction, metric, n_iter=100, max_trees=1000, max_depth=20)

Analysis: Hyperparameter optimization for a regression trees model.

Calibrate a regression trees model by optimizing its cross-validation score with respect to model hyperparameters - number of trees and individual tree depth (or height) limit.

See below for a pseudo-code of the algorithm:

Split the dataset into ds_train_val/ds_holdout based on train_fraction.
Use k-fold cross validation to split ds_train_val into cv pairs (ds_train, ds_val).
Initialize cv_history = []
For 1..n_iter
pick a set of hyperparameters (hp) to test based on cv_history.
use hp to calibrate a model on each cross-validation set
evaluate it on corresponding sample set-aside for cross-validation
compute an average cv score and append it to cv_history.
Pick the hyper parameters with the best cv score.
Train a model using the complete ds_train_val data set.
Evaluate the model on the holdout data set.
Return resulting model and its performance on the holdout set.

Parameters
Returns

Analysis object representing the model calibration process with hyperparameter optimization. It can be executed using the run() method to output the analysis results, including

1. model calibrated with recommended hyperparameters and

2. its performance on the holdout dataset.

Return type

HyperOptAnalysis

Paper with hyperparameter optimization algorithm:

http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

## Context Managers¶

leapyear.analytics.ignore_computation_cache()

Temporary context where computations do not utilize the computation cache.

The computation cache is intended to prevent wasting privacy exposure on queries that were previously computed. Entering this context manager will disable the use of the cache and allow repeated computations to return different differentially private answers.

Example

An administrator wants to run a count multiple times to estimate the random distribution of responses around the precise value.

>>> with ignore_computation_cache():
>>>     results = [la.count_rows(table).run() for _ in range(10)]


Note

Additional permissions may be required to disable the computation cache.

Return type

None

leapyear.analytics.precise_computations(precise=True)

Temporary context specifying if the computations are precise or not.

Computations requested within this context would be executed in precise mode, where differential privacy is not applied.

Parameters

precise (bool) – True to enable precise computations within the context, False to disable them.

Example

An administrator wants to compare the responses of a number of computations with and without differential privacy applied. Precise mode may not be available for all computations.

>>> def my_computation():
>>>     symbols = ("APPL", "GOOG", "MSFT"):
>>>     return [la.count_rows(table.where(col("SYM") == lit(val)).run() for val in symbols]
>>>
>>> res_dp = my_computation()
>>> with precise_computations():
>>>    res_no_dp = my_computation()


Note

Additional permissions may be necessary to enable precise computations.

Return type

None

LeapYear save and load machine learning models utilities.

leapyear.ml_import_export.save(model, path_or_fd)

Save machine learning models in json to either a file or a file-like object.

Parameters
• model (Union[~_Model, RichResult[~_Model, ~_ModelMetadata]]) – Any machine learning model executed using the run() method.

• path – The path where to save the file in the file system or a descriptor for an in-memory stream.

Example

>>> from leapyear.ml_import_export import save
>>> save(model, 'model.json')

Return type

None

leapyear.ml_import_export.load(path_or_fd, expected_model_type=None)

Load machine learning models from a file-like object.

Parameters
• path – The path in the file system or an in-memory stream from where to load the model.

• expected_mode_type – If None it won’t check that the model being loaded is of the type specified. Otherwise it checks that the model loaded is of the type expected.

Examples

>>> from leapyear.ml_import_export import load

>>> from leapyear.model import RandomForest