Getting Started¶
Connecting to LeapYear and Exploring¶
The first step to using LeapYear’s data security platform for analysis is
getting connected. To get started, we’ll import the Client
object from the leapyear
python library and
connect to the LeapYear server using our user credentials.
Credentials used for this tutorial:
>>> url = 'http://localhost:{}'.format(os.environ.get('LY_PORT', 4408))
>>> username = 'tutorial_user'
>>> password = 'abcdefghiXYZ1!'
Import the Client
object:
>>> from leapyear import Client
Create a connection:
>>> client = Client(url, username, password)
>>> client.connected
True
>>> client.close()
>>> client.connected
False
Alternatively, Client
is also a context manager, so the connection
is automatically closed at the end of a with
block:
>>> with Client(url, username, password) as client:
... # carry out computations with connection to LeapYear
... client.connected
True
>>> client.connected
False
Databases, Tables and Columns¶
Once we’ve obtained a connection to LeapYear, we can look through the databases and tables that are available for data analysis:
>>> client = Client(url, username, password)
Examine databases available to the user:
>>> client.databases.keys()
dict_keys(['tutorial'])
>>> tutorial_db = client.databases['tutorial']
>>> tutorial_db
<Database tutorial>
Examine tables within the database tutorial:
>>> sorted(tutorial_db.tables.keys())
['classification',
'regression1',
'regression2',
'twoclass']
>>> example1 = tutorial_db.tables['regression1']
>>> example1
<Table regression1>
Examine the columns on table tutorial_db.regression1:
>>> example1.columns
{'x0': <Column x0: Type(baseType=BaseType.REAL(-4, 4), nullable=False)>,
'x1': <Column x1: Type(baseType=BaseType.REAL(-4, 4), nullable=False)>,
'x2': <Column x2: Type(baseType=BaseType.REAL(-4, 4), nullable=False)>,
'y': <Column y: Type(baseType=BaseType.REAL(-400, 400), nullable=False)>}
Column Types¶
The Column
type gives the metadata for the table
attribute. These include the type, which is one of BOOL
, INT
, REAL
,
FACTOR
, DATE
, TEXT
or DATETIME
, in addition to
NULL_*
varieties of each. The NULL_*
varieties support missing data.
Unlike python types, all types in the LeapYear system (except BOOL
, TEXT
and their NULL
variants)
have publicly available bounds on the data source. These are typically the lower and upper limit of the
data in the column. The exceptions to this are the FACTOR
and NULL_FACTOR
types, which are categorical;
their elements are represented by strings and do not support ordering, hence there
are no lower or upper bounds. All possible values that the categorical types can
take are explicitly listed in the type. An example of a FACTOR
type with
categories “A”, “B” and “C” is shown below.
>>> from leapyear.admin import Type
>>> Type.FACTOR({'A', 'B', 'C'})
Type(baseType=BaseType.FACTOR({...}), nullable=False)
The DataSet Class¶
Once we’ve established a connection to the LeapYear server using the
Client
class, we can import the
DataSet
to access and analyze tables.
>>> from leapyear import DataSet
We can access tables, either using the client interface as above:
>>> ds_example1 = DataSet.from_table(example1)
or by directly referencing the table by name:
>>> ds_example1 = DataSet.from_table('tutorial.regression1')
The DataSet
class is the primary way of interacting
with data in the LeapYear system. A DataSet
is associated with
collection of Attributes
, which can
be used to compute statistics. The
DataSet
class allows the user to manipulate and analyze the attributes of
a data source using a variety of relational operations such as
column selection, row selection based on conditions, unions, joins, etc.
An instance of the Attribute
class represents either an individual named
column in the DataSet
or a transformation of one or several of such
columns via supported operations.
Attributes, like columns, have types BOOL
, INT
, REAL
, DATE
, TIME
, DATETIME
,
and FACTOR
and additional NULL_*
varieties, which can contain missing data.
Attributes
can be manipulated using most built in Python operations, such as +
, *
, and abs
.
>>> ds_example1.schema
Schema([('x0', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
('x1', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
('x2', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
('y', Type(baseType=BaseType.REAL(-400, 400), nullable=False))])
>>> attr_x0 = ds_example1['x0']
>>> attr_x0
<leapyear.dataset.attribute.Attribute object at 0x...>
>>> attr_x0 + 4
<leapyear.dataset.attribute.Attribute object at 0x...>
In the following example, we’ll take a few attributes from the table
tutorial.regression1
, adding one to the x1
attribute and multiplying
x2
by three. The bounds are altered to reflect the change.
>>> ds1 = ds_example1.map_attributes(
... {'x1': lambda att: att + 1.0, 'x2': lambda att: att * 3.0}
... )
>>> ds1.schema
Schema([('x0', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
('x1', Type(baseType=BaseType.REAL(-3, 5), nullable=False)),
('x2', Type(baseType=BaseType.REAL(-12, 12), nullable=False)),
('y', Type(baseType=BaseType.REAL(-400, 400), nullable=False))])
We can use DataSet
to filter the data to examine subsets
of the data, e.g. by applying predicates to the data:
>>> ds2 = ds_example1.where(ds_example1['x1'] > 1)
>>> ds2.schema
Schema([('x0', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
('x1', Type(baseType=BaseType.REAL(1, 4), nullable=False)),
('x2', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
('y', Type(baseType=BaseType.REAL(-400, 400), nullable=False))])
Data Analysis¶
Statistics¶
The LeapYear system is designed to allow access to various statistical
functions and develop machine learning models based on data in DataSet
.
The analytics function is not executed until the run()
method is called on it. This
allows inspection of the overall workflow and early reporting of errors. All analysis
functions are located in the leapyear.analytics
module.
>>> import leapyear.analytics as analytics
Many common statistics functions are available including:
Next is an example of obtaining simple statistics from the dataset:
>>> mean_analysis = analytics.mean('x0', ds_example1)
>>> mean_analysis.run()
0.039159280186637294
>>> variance_analysis = analytics.variance('x0', ds_example1)
>>> variance_analysis.run()
1.0477940098374177
>>> quantile_analysis = analytics.quantile(0.25, 'x0', ds_example1)
>>> quantile_analysis.run()
-0.6575000000000001
By combining statistics with the ability to transform and filter data, we can look at various statistics associated to subsets of the data:
>>> analytics.mean('x0', ds_example1).run()
0.039159280186637294
>>> ds2 = ds_example1.where(ds_example1['x1'] > 1)
>>> analytics.mean('x0', ds2).run()
0.14454229785771325
Machine Learning¶
The leapyear.analytics
module also supports various machine learning (ML)
models, including
regression-based models (linear, logistic, generalized),
tree-based models (random forests for classification and regression tasks),
unsupervised models (e.g. K-means, PCA),
the ability do optimize model hyperparemeters via search with cross-validation, and
the ability to evaluate model performance based on a variety of common validation metrics.
In this section we will share some examples of the machine learning tools provided by the LeapYear system.
The Effect of L2 Regularization on Model Coefficients¶
The following example code shows a common theoretical result from ML: as the L2 regularization parameter alpha increases, we see the coefficients of the model gradually approach zero. This is depicted in the graph generated below:
>>> n_alphas = 20
>>> alphas = np.logspace(-2,2, n_alphas)
>>>
>>> # example3 has 0 and 1 in the y column. Here, we convert 1 to True and 0 to False
>>> ds_example3 = DataSet\
... .from_table('tutorial.classification')\
... .map_attribute('y', lambda att: att.decode({1: True}).coalesce(False))
>>>
>>> models = []
>>> for alpha in alphas:
... model = analytics.generalized_logreg(
... ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9'],
... 'y',
... ds_example3,
... affine=False,
... l1reg=0.001,
... l2reg=alpha
... ).run()
... models.append(model)
>>>
>>> coefs = np.array([m.coefficients + [m.intercept] for m in models]).reshape((n_alphas,11))
Plotting the coefficients with respect to alpha values:
>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> plt.plot(alphas, coefs)
>>> plt.xscale('log')
>>> plt.xlabel('alpha')
>>> plt.ylabel('weights')
>>> plt.title('coefficients as a function of the regularization')
>>> plt.axis('tight')
>>> plt.show()

Training a Simple Logistic Regression Model¶
This example shows how to compute a logistic regression classifier and evaluate it’s performance using the receiver operating characteristic (ROC) curve.
>>> ds_train = ds_example3.split(0, [80, 20])
>>> ds_test = ds_example3.split(1, [80, 20])
>>> glm = analytics.generalized_logreg(['x1'], 'y', ds_train, affine=True, l1reg=0, l2reg=0.01).run()
>>> cc = analytics.roc(glm, ['x1'], 'y', ds_test, thresholds=32).run()
Plot the ROC and display the area under the ROC:
>>> plt.figure()
>>> plt.plot(cc.fpr, cc.tpr, label='ROC curve (area = %0.2f)' % cc.auc_roc)
>>> plt.plot([0, 1], [0, 1], 'k--')
>>> plt.xlabel('False Positive Rate')
>>> plt.ylabel('True Positive Rate')
>>> plt.title('Receiver operating characteristic example')
>>> plt.legend(loc="lower right")
>>> plt.show()

Training a Random Forest¶
In this example we train a random forest classifier on a binary classification
problem associated to two overlapping gaussian distributions centered at (0,0)
and (3,3)
.
Points around (0,0)
are labeled as in the negative class while points around (3,3)
are
labeled as in the positive class.
>>> ds_example4 = DataSet.from_table('tutorial.twoclass')
>>> rf = analytics.random_forest(['x1', 'x2'], 'y', ds_example4, 100, 1).run()
>>> plot_colors = "br"
>>> plot_step = 0.1
>>>
>>> x_min, x_max = 1.5-8, 1.5+8
>>> y_min, y_max = 1.5-8, 1.5+8
>>> xx, yy = np.meshgrid(
... np.arange(x_min, x_max, plot_step),
... np.arange(y_min, y_max, plot_step)
... )
>>> Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
>>> Z = Z.reshape(xx.shape)
Plot the decision boundary:
>>> fig, ax = plt.subplots()
>>> plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
>>> # Draw circles centered at the gaussian distributions
>>> ax.add_artist(plt.Circle((0,0), 1.5, color='k', fill=False))
>>> ax.add_artist(plt.Circle((3,3), 1.5, color='k', fill=False))
>>> ax.text(3, 3, '+')
>>> ax.text(0, 0, '-')
>>> plt.xlabel('x')
>>> plt.ylabel('y')
>>> plt.title('Decision Boundary')

This concludes the user tutorial section, so the connection should be closed.
>>> client.close()
>>> client.connected
False
Management and Administration¶
Administration tasks use the Client
class from the
leapyear
module and admin classes from the
leapyear.admin
. These admin classes include:
These classes provide API’s for various administrator tasks on the LeapYear system. All of the examples in the administrative examples section will require correct permissions.
Managing the LeapYear Server¶
Management requires sufficient privileges. The examples below assume the root user is an administrator of the LeapYear deployment system.
>>> client = Client(url, 'root', ROOT_PASSWORD)
>>> client.connected
True
User Management¶
User
objects are used as the primary API for managing users. Below
is an example of a user being created, their password updated, and finally
their account is disabled.
>>> # Create the user
>>> user = User('new_user', password)
>>> client.create(user)
>>> 'new_user' in client.users
True
>>>
>>> # Update the user's password
>>> new_password = '{}100'.format(password)
>>> user.update(password=new_password)
<User new_user>
>>>
>>> # Disable the user
>>> user.enabled
True
>>>
>>> user.enabled = False
>>> user.enabled
False
Database Management¶
Database
objects are used to view and manipulate databases on the server.
>>> # create database
>>> client.create(Database('sales'))
>>>
>>> # retrieve a reference to the database
>>> sales_database = client.databases['sales']
>>>
>>> # drop database
>>> client.drop(sales_database)
Table Management¶
Table
objects are used to view and manipulate tables in a database on
the server. Below is an example of how to define a data source (table) object
on the LeapYear server.
>>> credentials = 'hdfs:///path/to/data.parquet'
>>>
>>> # create a table
>>> accounts = Database('accounts')
>>> table = Table('users', credentials=credentials, database=accounts)
>>>
>>> client.create(accounts)
>>> client.create(table)
>>>
>>> # retrieve a reference to the table
>>> users_table = accounts.tables['users']
>>>
>>> # drop a table
>>> client.drop(users_table)