Module leapyear.dataset

DataSet and Attribute.

DataSets combine a data source (Table) with data transformations which subsequent computations can be performed on. A data set has a schema which describes the types of attributes (columns) in the data source. Attributes can be manipulated as if they were built-in python types (int, float, bool, …). DataSet also provides the following lazy methods for transforming data:

  • project(): select one or more attributes (columns) from the data source.

  • where(): select rows that satisfy a filter condition.

  • union(): combine two data sets with matching schemas.

  • split(): create subsets whose size is a fraction of the total data size.

  • splits(): yield all fractional partitions of the data set.

  • stratified(): create subsets of the data with fixed attribute prevalence.

  • group_by(): create aggregated views of the data set.

  • join(): combine rows of two datasets where certain data elements match.

  • transform(): apply a linear transformation to the specified data elements.

Further details for each transformation can be found in their respective documentation below. Transformations are lazy in the sense that they are not evaluated until a computation is executed; however, each transformation requires DataSet schema to be re-evaluated, which relies on a live connection to the LeapYear server.

Computations and machine learning analytics that process the data set are found in leapyear.analytics.

The examples below rely on a connection to LeapYear server, which can be established as follows:

>>> from leapyear import Client
>>> client = Client(url='http://ly-server:4408', username='admin', password='password')

Examples

  • Load a dataset and examine its schema:

>>> pds = DataSet.from_table('db.table')
>>> pds.schema
OrderedDict([
    ('attr1', Type(tag=BOOL, contents=())),
    ('attr2', Type(tag=INT, contents=(0, 20))),
    ('attr3', Type(tag=REAL, contents=(-1, 1))),
])
>>> pds.attributes
['attr1', 'attr2', 'attr3']
  • Create a new attribute and find its type:

>>> pds['attr2_gt_10'] = pds['attr2'] > 10
>>> pds.attributes
['attr1', 'attr2', 'attr3', 'attr2_gt_10']
>>> pds.schema['attr2_gt_10']
Type(tag=BOOL, contents=())
  • Select instances using a predicate:

>>> pds_positive_attr3 = pds.where(pds['attr3'] > 0)
>>> pds_positive_attr3.schema['attr3']
Type(tag=REAL, contents=(0, 1))
  • Calculate the mean of a single attribute:

>>> import leapyear.analytics as analytics
>>> mean_analysis = analytics.mean(pds_positive_attr3['attr3'])
>>> mean_analysis
computation: MEAN(attr3 > 0)
attributes:
    attr3: db.table.attr3 (0 <= x <= 1)
>>> mean_analysis.run()  # run the computation on the LeapYear server.
0.02

DataSet class

class leapyear.dataset.DataSet(relation)

DataSet object.

property relation

Get the DataSet relation.

Return type

Relation

property schema

Get the DataSet schema, including data types, values allowed.

Return type

Schema

classmethod from_view(view)

Create a dataset from a view.

Parameters

view (Union[str, View, ViewIdentifier]) – An identifier for the view on the LeapYear server.

Returns

The dataset with a LeapYear view as its source.

Return type

DataSet

classmethod from_table(table, slices=None)

Create a dataset from a table.

Parameters
  • table (Union[str, Table, TableIdentifier]) – An identifier for the table on the LeapYear server.

  • slices (Optional[List[Tuple[int, int]]]) – A list of ranges of table slices to use.

Returns

The dataset with a LeapYear table as its source.

Return type

DataSet

property attributes

Return the DataSet attributes.

Return type

Iterable[Attribute]

get_attribute(key)

Select one attribute from the data set.

Parameters

key (Union[str, Attribute]) – The item to project.

Returns

The projected attribute.

Return type

Attribute

drop_attributes(keys)

Drop Attributes.

Parameters

keys (Iterable[str]) – A list of attribute names to drop.

Returns

New DataSet without the specified attributes.

Return type

DataSet

drop_attribute(key)

Drop an attribute.

Parameters

key (str) – The attribute name to drop.

Returns

New DataSet without the specified attribute.

Return type

DataSet

with_attributes(name_attrs)

Return a new DataSet with additional attributes.

Parameters

name_attrs (Mapping[str, Any]) – A dictionary associating new attribute names to expressions. Attribute expressions can include python literals and references to existing attributes.

Returns

The DataSet with new attributes appended.

Return type

DataSet

with_attribute(name, attr)

Return a new DataSet with an additional attribute.

Parameters
  • name (str) – Name of the new attribute

  • attr (Any) – Attribute expression, can include python literals and references to existing attributes.

Returns

New DataSet with new attribute appended.

Return type

DataSet

with_attributes_renamed(rename)

Rename attributes using a mapping.

Parameters

rename (Mapping[str, str]) – Dictionary mapping old names to new names.

Returns

New DataSet with renamed attributes.

Return type

DataSet

Examples

1. Create new dataset ds2 with renamed columns ‘ZipCode’ and ‘Name’ which were named ‘zip_code’ and ‘name’ respectively. Similarly, rename attributes in another dataset ds3. Then, finally these datasets can be merged using union:

>>> ds2 = ds1.with_attributes_renamed({'zip_code':'ZipCode','name':'Name'})
>>> ds4 = ds3.with_attributes_renamed({'zipcode':'ZipCode','name_str':'Name'})
>>> ds4 = ds4.union(ds2)
with_attribute_renamed(old_name, new_name)

Rename an attribute.

Parameters
  • old_name (str) – Old attribute name.

  • new_name (str) – New attribute name.

Returns

New DataSet with this attribute renamed.

Return type

DataSet

map_attributes(name_lambdas)

Map attributes and create a new DataSet.

Parameters

name_lambdas (Mapping[str, Callable[[Attribute], Attribute]]) – Mapping from attribute names to functions.

Returns

New DataSet with mapped attributes.

Return type

DataSet

map_attribute(name, func)

Map an attribute and create a new DataSet.

Parameters
  • name (str) – The attribute name to map.

  • func (Callable[[Attribute], Attribute]) – Function to convert the attribute.

Returns

New DataSet with mapped attributes.

Return type

DataSet

project(new_keys)

Select multiple attributes from the data set.

Projection (π): Creates a new DataSet with only selected attributes from this DataSet.

Parameters

new_keys (Iterable[str]) – The list of attributes.

Returns

The new DataSet containing only the selected attributes.

Return type

DataSet

select(*attrs)

Create a new DataSet by selecting column names or Attributes.

Parameters

attrs (Union[str, Attribute]) – Attribute name(s) or Attribute object(s), separated by comma.

Returns

The new DataSet containing the selected attributes.

Return type

DataSet

select_as(mapping)

Create a new DataSet by mapping from column names or Attributes to new names.

Parameters

mapping (Mapping[str, Union[str, Attribute]]) – A dictionary associating new attribute names to expressions. Attribute expressions can include python literals and references to existing attributes.

Returns

The new DataSet containing the selected attributes.

Return type

DataSet

where(clause)

Select/filter rows using a filter expression.

Selection (σ): LeapYear’s where clause creates a new DataSet based on the filter condition. Its schema may include smaller domain of possible values.

Parameters

clause (Attribute) – A filter Attribute: a single attribute of type BOOL.

Returns

A filtered DataSet.

Return type

DataSet

union(other, distinct=False)

Union or concatenate datasets.

Union (∪): Concatenates data sets with matching schema.

Parameters
  • other (DataSet) –

    The dataset to union with. Note: schema of other must match that of self, including the order of the attributes. The order of attributes can be aligned like so:

    >>> ds2 = ds2[list(ds1.schema.keys())]
    >>> ds_union = ds1.union(ds2)
    

  • distinct (bool) – Remove duplicate rows.

Returns

A combined dataset.

Return type

DataSet

join(other, on, right_on=None, join_type='inner', unique_left_keys=False, unique_right_keys=False, left_suffix='', right_suffix='', **kwargs)

Combine two DataSets by joining using a key, as in SQL JOIN statement.

If right_k (or left_k) is specified, the right (or left) data set will include no more than k rows for each matching row in the left (or right) data set.

Parameters
  • other (DataSet) – The other DataSet to join with.

  • on (Union[str, List[str]]) – The key(s) to join on. If using suffixes, the suffix must NOT be appended by the caller.

  • right_on (Union[str, List[str], None]) – The key(s) on the right table, if different than those in on. None if both tables have the same key names. If using suffixes, the suffix must NOT be appended by the caller.

  • join_type (str) –

    LeapYear supports a variety of joins listed in the left column in the table below. Any value of join_type from the right column will use that particular join. If no value is specified, an inner join will be used. A left_outer_public join can be run only on a right public table.

    Join

    join_type

    Inner

    "inner" (default)

    Outer

    "outer", "full", "full_outer", or "fullouter"

    Left

    "left", "left_outer", or "leftouter"

    Right

    "right", "right_outer", or "rightouter"

    Left Antijoin

    "left_anti" or "leftanti"

    Left Semijoin

    "left_semi" or "leftsemi"

    Left Outer Public

    "left_outer_public"

  • unique_left_keys (bool) – If the left DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause data loss!

  • unique_right_keys (bool) – If the right DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause some data rows to not show up in the output DataSet!

  • left_suffix (str) – Optional suffix to append to the column names of the left DataSet.

  • right_suffix (str) – Optional suffix to append to the column names of the right DataSet.

  • -> cache (kwargs) – Optional StorageLevel for persisting intermediate datasets for performance enhancement. The meaning of these values is documented in: https://spark.apache.org/docs/2.4.5/programming-guide.html#rdd-persistence

Returns

The joined DataSet.

Return type

DataSet

Examples

  1. Joining two datasets on a common key:

>>> ds1 = example_ds1.project(['key', 'col1'])
>>> ds2 = example_ds2.project(['key', 'col2'])
>>> ds = ds1.join(ds2, 'key')
>>> list(ds.schema.keys())
['key', 'col1', 'col2']
  1. Joining two datasets on a single key but with a different name on the right:

>>> ds1 = example_ds1.project(['key1', 'col1'])
>>> ds2 = example_ds2.project(['key2', 'col2'])
>>> ds = ds1.join(ds2, 'key1', right_on='key2')
>>> list(ds.schema.keys())
['key1', 'col1', 'key2', 'col2']
  1. Joining when a column is duplicated is an error:

>>> ds1 = example_ds1.project(['key', 'col1', 'col3'])
>>> ds2 = example_ds2.project(['key', 'col2', 'col3'])
>>> ds = ds1.join(ds2, 'key')
SchemaError: RepeatedAliases(['col3'])

4. Joining when the key is missing from one relation: Fails because in this case a right key has to be specified.

>>> ds1 = example_ds1.project(['key1', 'col1'])
>>> ds2 = example_ds2.project(['key2', 'col2'])
>>> ds = ds1.join(ds2, 'key1')
MissingVariable: ScopeError(MissingVariable(Variable('key1')))
  1. Joining with multiple keys:

>>> ds1 = example_ds1.project(['key1_1', 'key2_1', 'col1'])
>>> ds2 = example_ds2.project(['key1_2', 'key2_2', 'col2'])
>>> ds = ds1.join(ds2, ['key1_1', 'key2_1'], right_on=['key1_2', 'key2_2'])
>>> list(ds.schema.keys())
['key1_1', 'key2_1', 'col1', 'key1_2', 'key2_2', 'col2']
  1. Joining with different number of keys results in an error:

>>> ds1 = example_ds1.project(['key1_1', 'key2_1', 'col1'])
>>> ds2 = example_ds2.project(['key1_2', 'key2_2', 'col2'])
>>> ds = ds1.join(ds2, ['key1_1', 'key2_1'], right_on='key1_2')
JoinKeyLengthMismatch: SchemaError: JoinKeyLengthMismatch(...)
  1. Joining with specifying that the keys are unique:

>>> ds1 = example_ds1.project(['key1', 'col1'])
>>> ds2 = example_ds2.project(['key1', 'col2'])
>>> ds = ds1.join(ds2, 'key1', unique_left_keys=True, unique_right_keys=True)
>>> list(ds.schema.keys())
['key1', 'col1', 'col2']
  1. Different join types:

>>> ds1 = example_ds1.project(['key1', 'col1'])
>>> ds2 = example_ds2.project(['key1', 'col2'])
>>> ds = ds1.join(ds2, 'key1', join_type='outer')
>>> list(ds.schema.keys())
['key1', 'col1', 'col2']
  1. Left semi join:

>>> ds1 = example_ds1.project(['key1', 'col1'])
>>> ds2 = example_ds2.project(['key1', 'col2'])
>>> ds = ds1.join(ds2, 'key1', join_type='left_semi')
>>> list(ds.schema.keys())
['key1', 'col1']
  1. Joining when the nulability of keys is different:

>>> ds1 = example_ds1.select([col('key1').decode({0: 0}).alias('nkey1'), 'col1'])
>>> ds2 = example_ds2.project(['key2', 'col2'])
>>> ds = ds1.join(ds2, 'nkey', right_on='key2')
>>> list(ds.schema.keys())
['nkey1', 'col1', 'key2', 'col2']
  1. Joining when keys have different but coercible types:

>>> realKey1 = col('intKey1').as_real().alias('realKey1')
>>> ds1 = example_ds1.select([realKey1, 'col1'])
>>> ds2 = example_ds2.project(['intKey2', 'col2'])
>>> ds = ds1.join(ds2, 'realKey1', right_on='intKey2')
>>> list(ds.schema.keys())
['intKey1', 'col1', 'intKey2', 'col2']
  1. Joining when keys are factors (upcasted to common type):

>>> keyFactor1 = col('key1').decode({k: 'A' for k in range(10)}).
>>>                as_factor().alias('keyFactor1')
>>> keyFactor2 = col('key2').decode({k: 'B' for k in range(10)}).
>>>                as_factor().alias('keyFactor2')
>>> ds1 = example_ds1.select([keyFactor1, 'col1'])
>>> ds2 = example_ds2.select([keyFactor2, 'col2'])
>>> ds = ds1.join(ds2, 'keyFactor1', right_on='keyFactor2')
>>> list(ds.schema.keys())
['key1', 'col1', 'key2', 'col2']
  1. Joining when the keys have mismatched types is an error (e.g. factor and bool):

>>> keyFactor1 = col('key1').as_factor().alias('keyFactor1')
>>> ds1 = example_ds1.project(['key1', 'col1'])
>>> ds2 = example_ds2.project(['key2', 'col2'])
>>> ds = ds1.join(ds3, 'keyFactor1', right_on='key2')
JoinColumnTypeMismatch: SchemaError: JoinColumnTypeMismatch(...)
  1. Joining with suffixes to disambiguate column names:

>>> ds1 = example_ds1.project(['key', 'col1'])
>>> ds2 = example_ds2.project(['key', 'col2'])
>>> ds = ds1.join(ds2, 'key', left_suffix='_l', right_suffix='_r')
>>> list(ds.schema.keys())
['key_l', 'col1_l', 'key_r', 'col2_r']
  1. Joining with only left suffixes to disambiguate column names:

>>> ds1 = example_ds1.project(['key', 'col1'])
>>> ds2 = example_ds2.project(['key', 'col2'])
>>> ds = ds1.join(ds2, 'key', left_suffix='_l')
>>> list(ds.schema.keys())
['key_l', 'col1_l', 'key', 'col2']
  1. Joining with only right suffixes to disambiguate column names:

>>> ds1 = example_ds1.project(['key', 'col1'])
>>> ds2 = example_ds2.project(['key', 'col2'])
>>> ds = ds1.join(ds2, 'key', right_suffix='_r')
>>> list(ds.schema.keys())
['key', 'col1', 'key_r', 'col2_r']
  1. Joining with specific cache level for intermediate caches:

>>> ds = ds1.join(ds2, on="key", cache=StorageLevel.DISK_ONLY, n_partitions=1)
unpersist_join_cache(other, on, right_on=None, join_type='inner', unique_left_keys=False, unique_right_keys=False, left_suffix='', right_suffix='', **kwargs)

Unpersist intermediate caches used by .join().

Parameters
  • NOTE (--) –

  • other (DataSet) – The other DataSet to join with.

  • on (Union[str, List[str]]) – The key(s) to join on. If using suffixes, the suffix must NOT be appended by the caller.

  • right_on (Union[str, List[str], None]) – The key(s) on the right table, if different than those in on. None if both tables have the same key names. If using suffixes, the suffix must NOT be appended by the caller.

  • join_type (str) –

    LeapYear supports a variety of joins listed in the left column in the table below. Any value of join_type from the right column will use that particular join. If no value is specified, an inner join will be used. A left_outer_public join can be run only on a right public table.

    Join

    join_type

    Inner

    "inner" (default)

    Outer

    "outer", "full", "full_outer", or "fullouter"

    Left

    "left", "left_outer", or "leftouter"

    Right

    "right", "right_outer", or "rightouter"

    Left Antijoin

    "left_anti" or "leftanti"

    Left Semijoin

    "left_semi" or "leftsemi"

    Left Outer Public

    "left_outer_public"

  • unique_left_keys (bool) – If the left DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause data loss!

  • unique_right_keys (bool) – If the right DataSet is known to have unique keys, setting this to True will run an optimized join algorithms. Warning: Setting this to True if the keys are not unique will cause some data rows to not show up in the output DataSet!

  • left_suffix (str) – Optional suffix to append to the column names of the left DataSet.

  • right_suffix (str) – Optional suffix to append to the column names of the right DataSet.

Return type

DataSet

group_by(*grouping)

Aggregate data by a categorical column(s).

Parameters

grouping (Union[str, Attribute]) – The attribute or attributes to group by, separated by comma.

Returns

A new GroupedData object with groupings as specified. It can be used with agg function to create a DataSet with derived aggregate attributes, see examples below.

Return type

GroupedData

Examples

1. Group by multiple columns (‘col1’ and ‘col2’) in Dataset ds and aggregate ‘col3’ and ‘col4’:

>>> ds_group = ds.group_by('col1', 'col2').agg((['col3'], 'count'), (['col4'], 'count'))
  1. Group by single column ‘col1’ in Dataset and compute aggregate of ‘col3’ and ‘col4’:

>>> ds_group = ds.group_by('col1').agg((['col3'], 'max'), (['col4'], 'mean'))
split(index, proportions, complement=False)

Split and select one partition of the dataset.

Selection (σ): Splits are specified by proportions.

Parameters
  • index (int) – The split number

  • proportions (List[int]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.

  • complement (bool) – If True, returns a DataSet which is the complement of the split (e.g. all rows not in the split). Default: False.

Returns

The ith-partition of the dataset.

Return type

DataSet

splits(proportions)

Split the dataset and return an iterator over the resulting partitions.

Selection (σ): Splits are specified by proportions.

Parameters

proportions (List[int]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.

Returns

Iterator over the DataSet objects representing partitions.

Return type

Iterator[DataSet]

Examples

  1. Create 80/20 split of a Dataset ds1:

>>> traintest, holdout = ds1.splits((8,2))
stratified_split(index, proportions, stratified, complement=False)

Split by stratifying a categorical attribute.

Selection (σ): Each split will contain approximately the same proportion of values from each category.

As an example, for Boolean stratification, each split will contain the same proportion of True/False values.

Parameters
  • index (int) – The split number

  • proportions (List[int]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.

  • stratified (str) – The column to stratify against. Must be Boolean or Factor. Must not be nullable.

  • complement (bool) – If True, returns a DataSet which is the complement of the split (e.g. all rows not in the split). Default: False.

Returns

The ith-partition of the dataset.

Return type

DataSet

stratified_splits(proportions, stratified)

Split by stratifying a categorical attribute.

Selection (σ). For boolean stratification, each split will maintain the same proportion of True/False values.

Parameters
  • proportions (List[int]) – The proportions to split the dataset by, represented as a list of integers. For example, [1,1,2] will split into 3 datasets with 1/4, 1/4 and 1/2 of the data in each, respectively.

  • stratified (str) – The column to stratify against. Must be Boolean or Factor. Must not be nullable.

Returns

Iterator over the DataSet objects representing partitions.

Return type

Iterable[DataSet]

kfold(n_folds=3)

Split the dataset into train/test pairs using k-fold strategy.

Each fold can then be used as a validation set once while k-1 remaining folds form the training set. If the dataset has size N, each (train, test) pair will be sized N*(k-1)/k and N/k respectively.

Parameters

n_folds (int) – Number of folds. Must be at least 2.

Returns

Iterator over the k pairs of (train, test) partitions.

Return type

Iterable[Tuple[“DataSet”, “DataSet”]]

stratified_kfold(stratified, n_folds=3)

Split the dataset into train/test pairs using k-fold stratified splits.

Each fold can then be used as a validation set once while k-1 remaining folds form the training set. If the dataset has size N, each (train, test) pair will be sized N*(k-1)/k and N/k respectively.

Parameters
  • stratified (str) – The column to stratify against. Must be Boolean or Factor. Must not be nullable.

  • n_folds (int) – Number of folds. Must be at least 2.

Returns

The iterator over the k pairs of (train, test) partitions.

Return type

Iterable[Tuple[“DataSet”, “DataSet”]]

rows(limit=None)

Retrieve rows.

If the user has permission to do so, the function returns a generator of OrderedDict objects representing the attribute names and values from each row. The generator requires connection to the server over its entire lifetime.

Parameters

limit (Optional[int]) – Maximum number of rows to output.

Returns

The iterator over the rows of the input dataset, each row being represented as an OrderedDict objects mapping attribute names to their values in this row.

Return type

Iterator[Mapping[str, Value]]

rows_pandas(limit=None)

Retrieve rows as a pandas DataFrame.

If the user has permission to do so, the function returns a generator of OrderedDict objects representing the attribute names and values from each row. The generator requires connection to the server over its entire lifetime.

Parameters

limit (Optional[int]) – Maximum number of rows to output.

Returns

The rows of the input dataset.

Return type

DataFrame

head(n=10)

Retrieve rows, if the user has permission to do so, see rows().

Parameters

n (int) – Maximum number of rows to output.

Returns

The iterator over the rows of the input dataset, each row being represented as an OrderedDict objects mapping attribute names to their values in this row.

Return type

Iterator[Mapping[str, Value]]

head_pandas(n=10)

Retrieve rows, if the user has permission to do so, see rows_pandas().

Parameters

n (int) – Maximum number of rows to output.

Returns

The rows of the input dataset.

Return type

DataFrame

example_rows()

Retrieve 10 rows of example data from the DataSet.

The returned data is based only on the public metadata. The generated data does not depend on the true data at all and should not be used for data inference.

The function requires an active connection to the LeapYear server.

Does not support TEXT attributes - consider dropping them using drop_attributes() before running example_rows().

Returns

The iterator over the rows of the generated dataset; each row is represented as an OrderedDict objects mapping attribute names to their generated values.

Return type

Iterator[Mapping[str, Any]]

Example

>>> pds = DataSet.from_table('db.table')
>>> pds.example_rows()

To turn the data into a pandas DataFrame, use

>>> import pandas
>>> df = pandas.DataFrame.from_dict(pds.example_rows())

Alternatively, use example_rows_pandas().

example_rows_pandas()

Retrieve 10 rows of example data from the DataSet. See example_rows().

Returns

The example rows.

Return type

DataFrame

transform(attrs, transformation, name)

Apply linear transformation to a list of attributes based on a matrix.

Parameters
  • attrs (List[Attribute]) – Expressions to use as input to the matrix multiplication, in order.

  • transformation (List[List[float]]) – Matrix to use to define linear transformation via matrix multiplication - e.g. output of leapyear.analytics.pca().

  • name (str) – Common prefix for the name of the attributes to be created.

Returns

DataSet with transformed attributes appended.

Return type

DataSet

sample(with_replacement, fraction, seed=None)

Return a sampled subset of the rows.

Parameters
  • with_replacement (bool) – Allow sampling rows with replacement, creating duplicated rows.

  • fraction (float) – Fraction of rows to generate.

  • seed (Optional[int]) – Optional seed of the random number generator. If not supplied, a random seed will be used.

Returns

DataSet containing sampled subset of the rows.

Return type

DataSet

distinct()

Return a DataSet that contains only the unique rows from this Dataset.

Return type

DataSet

drop_duplicates(subset=None)

Return a DataSet with duplicates in the provided columns dropped.

If all columns are named or subset is None, this is equivalent to distinct().

Parameters

subset (Optional[List[str]]) – Subset of columns, or None if all columns should be used.

Return type

DataSet

except_(ds)

Return a DataSet of rows in this DataSet but not in another DataSet.

Parameters

ds (DataSet) – DataSet to compare with.

Returns

DataSet containing rows in this DataSet but not in another DataSet.

Return type

“DataSet”

difference(ds)

Return a DataSet of rows in this DataSet but not in another DataSet.

Parameters

ds (DataSet) – DataSet to compare with.

Returns

DataSet containing rows in this DataSet but not in another DataSet.

Return type

“DataSet”

symmetric_difference(ds)

Return the symmetric difference of the two DataSets.

Parameters

ds (DataSet) – DataSet to compare with.

Returns

DataSet containing rows that are not in the intersection of the two DataSets.

Return type

“DataSet”

intersect(ds)

Return intersection of the two DataSets.

Parameters

ds (DataSet) – DataSet to compare with.

Returns

DataSet containing rows that are in the intersection of the two DataSets.

Return type

“DataSet”

order_by(*attrs)

Order DataSet by the given expressions.

Parameters

attrs (Union[str, Attribute]) – Attribute expressions to order by, separated by commas.

Returns

DataSet sorted by the given expressions.

Return type

“DataSet”

limit(n)

Limit the number of rows.

Parameters

n (int) – Number of rows to limit to.

Returns

DataSet filtered to the first n rows.

Return type

“DataSet”

cache(storageLevel=StorageLevel.MEMORY_AND_DISK)

Cache a Dataset on disk on the server-side.

Parameters

storageLevel (StorageLevel) – StorageLevel for persisting datasets. The meaning of these values is documented in: https://spark.apache.org/docs/2.4.5/programming-guide.html#rdd-persistence

Returns

DataSet, which indicates to the system to lazily cache the DataSet

Return type

“DataSet”

unpersist()

Immediately begins unpersisting the DataSet on the server-side.

Returns

Return type

None

Example

Build a cache and unpersist it

>>> la.count_rows(ds.cache()).run()
>>> la.mean(ds["foo"]).run()          # this will hit the cache
>>> ds.unpersist()
>>> la.sum(ds["foo"]).run()           # this will no longer hit the cache
repartition(numPartitions, *attrs)

Repartition a Dataset by hashing the columns.

Returns

DataSet partitioned according to the specified parameters.

Return type

“DataSet”

sortWithinPartitions(*attrs)

Sorts Dataset rows by the provided columns within partitions, not globally.

Returns

DataSet sorted within partitions.

Return type

“DataSet”

replace(col, replacement)

Replace values matching keys in replacement map.

Parameters
  • col (str) – The name of the column to work with.

  • replacement (Mapping[Union[float, int, str, date], Union[Attribute, Expression, float, int, str, date]]) – A mapping associating each value to be replaced – with an expression it should be replaced with.

Returns

New Dataset where specified values in a given column are replaced according to the replacement map.

Return type

“DataSet”

fill(col, value)

Fill missing values in a given column.

Parameters
  • col – The name of the column to work with.

  • value (Union[Attribute, Expression, float, int, str, date]) – Expression to fill NULL values of the column with.

Returns

A new DataSet where NULL values in a given column are replaced with the specified expression.

Return type

“DataSet”

drop(*cols, how='any')

Drop rows where specified columns contain NULL values.

Parameters
  • cols (Union[str, Attribute]) – Attributes to consider when dropping rows.

  • how (str) – If 'any', rows with NULL values in any of the specified columns will be dropped. If 'all', rows with NULL values in all of the specified columns will be dropped.

Returns

A new DataSet filtered to rows where specified columns contain NULL values (any or all, depending on the value of the how parameter).

Return type

“DataSet”

join_pandas(from_col, dataframe, key_col, value_col)

Join a column of pandas data with the data set.

See join_data for extended details.

Parameters
  • from_col (Union[Attribute, str]) – An expression or attribute name in the data set to join the data to. The type of this column should match the keys in the mapping.

  • dataframe (DataFrame) – The pandas DataFrame that contains values to map to in this data set.

  • key_col (str) – The name of the pandas column to obtain the keys to match from_col. If duplicated keys occur, only one key from the data set is used in the join.

  • value_col (str) – The name of the pandas column to obtain the values of the mapping.

Returns

The original data set with a new column containing analyst-supplied values.

Return type

DataSet

join_data(from_col, new_col, mapping)

Join key-value data to the data set.

Analyst supplied data can be added to an existing sensitive data set without a loss to privacy. This is a replacement for the decode expression when there are many keys (>100). This function and join_pandas also replace the functionality found in left_join_csv and left_join_df.

Values matching the keys of the dictionary mapping are replaced by the associated values. Values that do not match any of the keys are replaced by NULL. Keys and values may be python literals supported by the LeapYear client, or other Attributes.

Note

If the combination of keys assure there should be no NULL values, the client will not automatically convert the result of join_data to a non-nullable type. The user must use coalesce to remove NULL from the domain of possible values.

Parameters
  • from_col (Union[Attribute, str]) – An expression or attribute name in the data set to join the data to. The type of this column should match the keys in the mapping.

  • new_col (str) – The name of the new attribute that is added to the returned data set with the same type as the mapping values.

  • mapping (Mapping[Union[float, int, str, date], Union[float, int, str, date]]) – Python dictionary of key-value pairs to add to the data set. The keys should be unique. The mapping cannot be empty and has a limit of 100,000 keys.

Returns

The original data set with a new column containing analyst-supplied values.

Return type

DataSet

Example

Suppose that we have a table with a Sex column containing the values male and female:

Sex

Age

male

22

female

38

female

26

female

35

male

35

We’d like to encode the abbreviations (coming from the first letter) as a new column, coming from a pandas DataFrame of the following form:

Sex

first_letter

male

m

female

f

To do so, we call this function, assuming the LeapYear DataSet is called ds, and we wish to call the new column sex_first_letter:

new_ds = ds.join_data("Sex", "sex_first_letter", {"female": "f", "male": "m"})

This will produce a DataSet which looks like:

Sex

Age

sex_first_letter

male

22

m

female

38

f

female

26

f

female

35

f

male

35

m

predict(model, attrs=None, name='predict')

Return a DataSet with a prediction column for the model.

Parameters
Returns

The original dataset with the prediction column(s) appended.

Return type

DataSet

predict_proba(model, attrs=None, name='proba')

Return a DataSet with prediction probability columns for the model.

Parameters
Returns

The original dataset with the prediction probability column(s) appended.

Return type

DataSet

Attribute class

class leapyear.dataset.Attribute(expr, relation, ordering=OrderSpec(osDirection=SortDirection.Asc, osNullOrdering=None), name=None)

An attribute of a Dataset.

This exists for transformations to be performed on single attributes of the dataset. For example, the attribute height might be measured in meters however centimeters might be more appropriate, so an intermediate Attribute can be extracted from the DataSet and manipulated.

All attribute manipulations are lazy; they are not evaluated until they are needed to perform an analysis on the LeapYear server.

classmethod clear_cache()

Clear the memo cache of attribute types.

Return type

None

property type

Get the data type of this attribute.

Return type

Type

property expression

Get the Attribute’s expression.

Return type

Expression

property relation

Get the relation this Attribute originates from.

Return type

Optional[Relation]

property ordering

Get the ordering of this attribute.

Return type

OrderSpec

property name

Get the name or alias of this attribute.

Return type

str

alias(name)

Associate an alias with an attribute.

Parameters

name (str) – The name to associate with an attribute.

Return type

Attribute

is_not()

Return the boolean inverse of the attribute.

Return type

Attribute

sign()

Return the sign of each element (1, 0 or -1).

Return type

Attribute

floor()

Apply the floor transformation to the attribute.

Each attribute value is transformed to the largest integer less than or equal to the attribute value.

Return type

Attribute

ceil()

Apply the ceiling transformation to the attribute.

Each attribute value is transformed to the smallest integer greater than or equal to the attribute value.

Return type

Attribute

exp()

Apply exponent transformation to an attribute.

Return type

Attribute

expm1()

Apply exponent transformation to an attribute and subtract one.

Return type

Attribute

sqrt()

Apply square root transformation to an attribute.

Return type

Attribute

log()

Apply natural logarithm transformation to an attribute.

Return type

Attribute

log1p()

Apply natural logarithm transformation to an attribute and add 1.

Return type

Attribute

sigmoid()

Apply sigmoid transformation to an attribute.

Return type

Attribute

replace(mapping)

Replace specified values with the new values or attribute expressions.

Values matching the keys of the mapping and replaced by the associated values. Values that do not match any of the keys are kept.

Keys and values may be python literals supported by the LeapYear client, or other Attributes.

Parameters

mapping (Mapping[Union[float, int, str, date], NewType()(AttributeLike, Union[_Forwardref, Expression, float, int, str, date])]) – A mapping from values of this attribute’s type (T) to another set of values and a different type (U). U may be a python literal type (int, bool, datetime, …) or another Attribute.

Returns

The converted attribute.

Return type

Attribute

property microsecond

Return the microseconds part of a temporal type.

Return type

Attribute

property second

Return the seconds part of a temporal type.

Return type

Attribute

property minute

Return the minutes part of a temporal type.

Return type

Attribute

property hour

Return the hours part of a temporal type.

Return type

Attribute

property day

Return the days part of a temporal type.

Return type

Attribute

property month

Return the months part of a temporal type.

Return type

Attribute

property year

Return the years part of a temporal type.

Return type

Attribute

greatest(attrs)

Take the elementwise maximum.

Return NULL if and only if all attribute values are NULL.

Return type

Attribute

least(attrs)

Take the elementwise minimum.

Return NULL if and only if all entries are NULL.

Return type

Attribute

coalesce(fallthrough, attrs=None)

Convert all NULL values of an attribute to a value.

This function will extend the type of the attribute if necessary and drop the nullable tag from the attribute data type.

Parameters
  • fallthrough (Any) – The value of the same type as this attribute that all NULL values get converted into.

  • attrs (Optional[List[Attribute]]) – The list of attributes to try before the fallthrough case.

Returns

The non-nullable attribute with extended type range (if necessary).

Return type

Attribute

Examples

1. Use coalesce to replace missing values with ‘1’ and create a new attribute ‘col1_trn’:

>>> ds2 = ds1.with_attributes({'col1_trn':ds1['col1'].coalesce('1')})
isnull()

Boolean check whether an attribute value is NULL.

Return a boolean attribute which resolves to True whenever the underlying attribute value is NULL.

Return type

Attribute

notnull()

Boolean inverse of isnull().

Return type

Attribute

decode(mapping)

Map specified values to the new values or attribute expressions.

Values matching the keys of the mapping and replaced by the associated values.

Values that do not match any of the keys are replaced by NULL.

Keys and values may be python literals supported by the LeapYear client, or other Attributes.

If the combination of keys assure there should be no NULL values, the client will not automatically convert the result of decode to a non-nullable type. The user must use coalesce to remove NULL from the domain of possible values.

Parameters

mapping (Mapping[Any, Any]) – A mapping from values of this attribute’s type (T) to another set of values and a different type (U). U may be a python literal type (int, bool, datetime, …) or another Attribute.

Returns

The converted attribute.

Return type

Attribute

Examples

1. Create a new column ‘col1_trn’ that is based on values in ‘col1’. If the value matches ‘pattern’ we assign the same string otherwise we assign ‘Other’ using the decode function:

>>> import leapyear.functions as f
>>> some_func = lambda x: (x != 'pattern').decode({True:'Other', False:'pattern'})
>>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'))})

2. Create a new column ‘col1_trn’ that is based on values in ‘col1’. If the value == 0 then ‘col1_trn’ takes the value 0. Otherwise, it takes the value of ‘col2’:

>>> import leapyear.functions as f
>>> some_func = lambda x: (x==0).decode({True:0,False:f.col('col2')})
>>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'))})

3. Create a new column ‘col1_trn’ that is based on values in ‘col1’ and ‘col2. Based on an expression involving ‘col1’ and ‘col2’, ‘col1_trn’ takes the value ‘col1 or ‘col2’:

>>> import leapyear.functions as f
>>> def some_func(x,y):
        return ((x==0)&(y!=0)).decode({True:f.col('col1'),False:f.col('col2')})
>>> ds2 = ds1.with_attributes({'col1_trn': some_func(f.col('col1'),f.col('col2'))})
is_in(options)

Boolean check whether an attribute value is in a list.

Return a boolean attribute which resolves to True whenever the underlying attribute matches one of the list entries.

Return type

Attribute

text_to_bool()

Convert the attribute to a boolean.

Strings that match (case insensitively) “1”, “y”, “yes”, “t”, or “true” are converted to True;

Strings that match (case insensitively) “0”, “n”, “no”, “f”, or “false” are converted to False. Other strings are treated as NULL.

Return type

Attribute

text_to_real(lb, ub)

Convert a text attribute to a real-valued attribute.

Parameters
  • lb (float) – The lower bound for the domain of possible values of the new attribute. All values below this will be converted to NULL.

  • ub (float) – The upper bound for the domain of possible values of the new attribute. All values below this will be converted to NULL.

Return type

Attribute

text_to_factor(distinct_vals_list)

Convert a text attribute to a factor attribute.

Parameters

distinct_vals_list – Input list of distinct values the input text column takes. Rows with values that are not in this list will be filled with nulls.

Return type

Attribute

text_to_int(lb, ub)

Convert a text attribute to an integer-valued attribute.

NOTE: integers close to MAX_INT64 or MIN_INT64 are not represented precisely.

Parameters
  • lb (int) – The lower bound for the domain of possible values of the new attribute. All values below this will be converted to NULL.

  • ub (int) – The upper bound for the domain of possible values of the new attribute. All values below this will be converted to NULL.

Return type

Attribute

as_real()

Convert the attribute to a real.

Return type

Attribute

as_factor()

Represent this attribute as a factor attribute.

Converts the attribute of type INT or BOOL to FACTOR.

Note: For more complex conversions, consider decode().

Examples

  1. Convert an attribute ‘col1’ in ds1 to factor:

>>> ds2 = ds1.with_attributes({'col1_fac':ds1['col1'].as_factor()})
Return type

Attribute

asc()

Sort ascending.

Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset.

Examples

  1. Order a dataset by multiple cols and drop duplicates:

>>> ds2 = ds1.order_by(ds1['col1'].asc(),ds1['col2'].asc(),ds1['col3'].asc())
>>> ds2 = ds2.drop_duplicates(['col2'])
Return type

Attribute

asc_nulls_first()

Sort ascending, missing values first.

Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset, with NULL values first.

Return type

Attribute

asc_nulls_last()

Sort ascending, missing values last.

Causes the attribute to be sorted in ascending order when the sorting order is applied to the dataset, with NULL values last.

Return type

Attribute

desc()

Sort decending.

Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset.

Return type

Attribute

desc_nulls_first()

Sort descending, missing values first.

Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset, with NULL values first.

Return type

Attribute

desc_nulls_last()

Sort descending, missing values last.

Causes the attribute to be sorted in descending order when the sorting order is applied to the dataset, with NULL values last.

Return type

Attribute

Aliases

leapyear.dataset.attribute.AttributeLike(x)

Attribute-like objects are those that can be readily converted to an attribute.

These include existing attributes, dates, strings, integers, floats and expressions based on these objects.

Grouping and Windowing classes

class leapyear.dataset.GroupedData(grouping, rel)

The result of a DataSet.group_by().

Run agg() to get a DataSet.

agg(*attr_aggs, max_count=None, **kwargs)

Specify the aggregations to perform on the GroupedData.

Parameters
  • attr_aggs (Tuple[List[Union[Attribute, str]], Union[str, Aggregation]]) –

    A list of pairs. The second element of the pair is the aggregation to perform; the first is a list of columns on which to perform it. This is a list and not just a single attribute to support nullary and binary aggregations.

    Available aggregations:

    • min

    • max

    • and

    • or

    • count_distinct

    • approx_count_distinct

    • count

    • mean

    • stddev

    • stddev_samp

    • stddev_pop

    • variance

    • var_samp

    • var_pop

    • skewness

    • kurtosis

    • sum

    • sum_distinct

    • covar_samp

    • covar_pop

    • corr

  • max_count (Optional[int]) –

    If count aggregate is requested, this parameter will be used as an upper bound in the schema of the derived aggregate count attribute(s) of the grouped DataSet. If not supplied, the upper bound would be inferred from the data before returning the resulting DataSet object.

    Note

    When this parameter is set too high, differentially private computations on the derived aggregate attribute would include higher randomization effect. When it is set too low, all counts higher than max_count will be replaced by max_count.

  • kwargs (Any) – All keyword arguments are passed to the function run when the parameter max_count has to be inferred from the data. This includes max_timeout_sec, which defaults to 300 seconds.

Returns

A DataSet object, containing aggregation results.

Return type

DataSet

Example

To compute correlation of height and weight as well as the count, run:

>>> gd.agg((['height','weight'], 'corr'), ([], 'count'))
class leapyear.dataset.Window

Utility functions for defining WindowSpec.

Examples

>>> # ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
>>> window = Window.order_by("date").rows_between(
>>>     Window.unbounded_preceding, Window.current_row,
>>> )
>>> # PARTITION BY country ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
>>> window = Window.order_by("date").partition_by("country").rows_between(-3, 3)
classmethod partition_by(*cols)

Create a WindowSpec with the partitioning defined.

Parameters

cols (NewType()(AttributeLike, Union[_Forwardref, Expression, float, int, str, date])) – each attribute that should be used to partition windows. If using an attribute name, user must use leapyear.functions.col in order to reference the attribute.

Return type

WindowSpec

classmethod order_by(*cols)

Create a WindowSpec with the ordering defined.

Parameters

cols (NewType()(AttributeLike, Union[_Forwardref, Expression, float, int, str, date])) – each attribute that should be used to order within a window. If using an attribute name, user must use leapyear.functions.col in order to reference the attribute.

Return type

WindowSpec

classmethod rows_between(start, end)

Create a WindowSpec with the frame boundaries defined.

Create a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive). Both start and end are relative positions from the current row. For example, “0” means “current row”, while “-1” means the row before the current row, and “5” means the fifth row after the current row. We recommend users use Window.unboundedPreceding, Window.unboundedFollowing, and Window.currentRow to specify special boundary values, rather than using integral values directly.

Note: windows of 1000 rows or more are not currently supported for expressions other than lead and lag.

Parameters
  • start (int) – boundary start, inclusive. The frame is unbounded if this is Window.unboundedPreceding, or any value less than or equal to -9223372036854775808.

  • end (int) – boundary end, inclusive. The frame is unbounded if this is Window.unboundedFollowing, or any value greater than or equal to 9223372036854775807.

Return type

WindowSpec