Getting Started

Connecting to LeapYear and Exploring

The first step to using LeapYear’s data security platform for analysis is getting connected. To get started, we’ll import the Client object from the leapyear python library and connect to the LeapYear server using our user credentials.

Credentials used for this tutorial:

>>> url = 'http://localhost:{}'.format(os.environ.get('LY_PORT', 4408))
>>> username = 'tutorial_user'
>>> password = 'abcdefghiXYZ1!'

Import the Client object:

>>> from leapyear import Client

Create a connection:

>>> client = Client(url, username, password)
>>> client.connected
True
>>> client.close()
>>> client.connected
False

Alternatively, Client is also a context manager, so the connection is automatically closed at the end of a with block:

>>> with Client(url, username, password) as client:
...     # carry out computations with connection to LeapYear
...     client.connected
True
>>> client.connected
False

Databases, Tables and Columns

Once we’ve obtained a connection to LeapYear, we can look through the databases and tables that are available for data analysis:

>>> client = Client(url, username, password)

Examine databases available to the user:

>>> client.databases.keys()
dict_keys(['tutorial'])
>>> tutorial_db = client.databases['tutorial']
>>> tutorial_db
<Database tutorial>

Examine tables within the database tutorial:

>>> sorted(tutorial_db.tables.keys())  
['classification',
 'regression1',
 'regression2',
 'twoclass']
>>> example1 = tutorial_db.tables['regression1']
>>> example1
<Table regression1>

Examine the columns on table tutorial_db.regression1:

>>> example1.columns  
{'x0': <Column x0: Type(baseType=BaseType.REAL(-4, 4),     nullable=False)>,
 'x1': <Column x1: Type(baseType=BaseType.REAL(-4, 4),     nullable=False)>,
 'x2': <Column x2: Type(baseType=BaseType.REAL(-4, 4),     nullable=False)>,
 'y':  <Column y:  Type(baseType=BaseType.REAL(-400, 400), nullable=False)>}

Column Types

The Column type gives the metadata for the table attribute. These include the type, which is one of BOOL, INT, REAL, FACTOR, DATE, TEXT or DATETIME, in addition to NULL_* varieties of each. The NULL_* varieties support missing data.

Unlike python types, all types in the LeapYear system (except BOOL, TEXT and their NULL variants) have publicly available bounds on the data source. These are typically the lower and upper limit of the data in the column. The exceptions to this are the FACTOR and NULL_FACTOR types, which are categorical; their elements are represented by strings and do not support ordering, hence there are no lower or upper bounds. All possible values that the categorical types can take are explicitly listed in the type. An example of a FACTOR type with categories “A”, “B” and “C” is shown below.

>>> from leapyear.admin import Type
>>> Type.FACTOR({'A', 'B', 'C'})
Type(baseType=BaseType.FACTOR({...}), nullable=False)

The DataSet Class

Once we’ve established a connection to the LeapYear server using the Client class, we can import the DataSet to access and analyze tables.

>>> from leapyear import DataSet

We can access tables, either using the client interface as above:

>>> ds_example1 = DataSet.from_table(example1)

or by directly referencing the table by name:

>>> ds_example1 = DataSet.from_table('tutorial.regression1')

The DataSet class is the primary way of interacting with data in the LeapYear system. A DataSet is associated with collection of Attributes, which can be used to compute statistics. The DataSet class allows the user to manipulate and analyze the attributes of a data source using a variety of relational operations such as column selection, row selection based on conditions, unions, joins, etc.

An instance of the Attribute class represents either an individual named column in the DataSet or a transformation of one or several of such columns via supported operations. Attributes, like columns, have types BOOL, INT, REAL, DATE, TIME, DATETIME, and FACTOR and additional NULL_* varieties, which can contain missing data. Attributes can be manipulated using most built in Python operations, such as +, *, and abs.

>>> ds_example1.schema  
Schema([('x0', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
        ('x1', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
        ('x2', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
        ('y', Type(baseType=BaseType.REAL(-400, 400), nullable=False))])
>>> attr_x0 = ds_example1['x0']
>>> attr_x0
<leapyear.dataset.attribute.Attribute object at 0x...>
>>> attr_x0 + 4
<leapyear.dataset.attribute.Attribute object at 0x...>

In the following example, we’ll take a few attributes from the table tutorial.regression1, adding one to the x1 attribute and multiplying x2 by three. The bounds are altered to reflect the change.

>>> ds1 = ds_example1.map_attributes(
...     {'x1': lambda att: att + 1.0, 'x2': lambda att: att * 3.0}
... )
>>> ds1.schema  
Schema([('x0', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
        ('x1', Type(baseType=BaseType.REAL(-3, 5), nullable=False)),
        ('x2', Type(baseType=BaseType.REAL(-12, 12), nullable=False)),
        ('y', Type(baseType=BaseType.REAL(-400, 400), nullable=False))])

We can use DataSet to filter the data to examine subsets of the data, e.g. by applying predicates to the data:

>>> ds2 = ds_example1.where(ds_example1['x1'] > 1)
>>> ds2.schema  
Schema([('x0', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
        ('x1', Type(baseType=BaseType.REAL(1, 4), nullable=False)),
        ('x2', Type(baseType=BaseType.REAL(-4, 4), nullable=False)),
        ('y', Type(baseType=BaseType.REAL(-400, 400), nullable=False))])

Data Analysis

Statistics

The LeapYear system is designed to allow access to various statistical functions and develop machine learning models based on data in DataSet. The analytics function is not executed until the run() method is called on it. This allows inspection of the overall workflow and early reporting of errors. All analysis functions are located in the leapyear.analytics module.

>>> import leapyear.analytics as analytics

Many common statistics functions are available including:

Next is an example of obtaining simple statistics from the dataset:

>>> mean_analysis = analytics.mean('x0', ds_example1)
>>> mean_analysis.run()
0.039159280186637294
>>> variance_analysis = analytics.variance('x0', ds_example1)
>>> variance_analysis.run()
1.0477940098374177
>>> quantile_analysis = analytics.quantile(0.25, 'x0', ds_example1)
>>> quantile_analysis.run()
-0.6575000000000001

By combining statistics with the ability to transform and filter data, we can look at various statistics associated to subsets of the data:

>>> analytics.mean('x0', ds_example1).run()
0.039159280186637294
>>> ds2 = ds_example1.where(ds_example1['x1'] > 1)
>>> analytics.mean('x0', ds2).run()
0.14454229785771325

Machine Learning

The leapyear.analytics module also supports various machine learning (ML) models, including

  • regression-based models (linear, logistic, generalized),

  • tree-based models (random forests for classification and regression tasks),

  • unsupervised models (e.g. K-means, PCA),

  • the ability do optimize model hyperparemeters via search with cross-validation, and

  • the ability to evaluate model performance based on a variety of common validation metrics.

In this section we will share some examples of the machine learning tools provided by the LeapYear system.

The Effect of L2 Regularization on Model Coefficients

The following example code shows a common theoretical result from ML: as the L2 regularization parameter alpha increases, we see the coefficients of the model gradually approach zero. This is depicted in the graph generated below:

>>> n_alphas = 20
>>> alphas = np.logspace(-2,2, n_alphas)
>>>
>>> # example3 has 0 and 1 in the y column. Here, we convert 1 to True and 0 to False
>>> ds_example3 = DataSet\
...    .from_table('tutorial.classification')\
...    .map_attribute('y', lambda att: att.decode({1: True}).coalesce(False))
>>>
>>> models = []
>>> for alpha in alphas:
...     model = analytics.generalized_logreg(
...         ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9'],
...         'y',
...         ds_example3,
...         affine=False,
...         l1reg=0.001,
...         l2reg=alpha
...     ).run()
...     models.append(model)
>>>
>>> coefs = np.array([m.coefficients + [m.intercept] for m in models]).reshape((n_alphas,11))

Plotting the coefficients with respect to alpha values:

>>> import matplotlib.pyplot as plt
>>> plt.figure()
>>> plt.plot(alphas, coefs)
>>> plt.xscale('log')
>>> plt.xlabel('alpha')
>>> plt.ylabel('weights')
>>> plt.title('coefficients as a function of the regularization')
>>> plt.axis('tight')
>>> plt.show()
L2 Coefficients

Training a Simple Logistic Regression Model

This example shows how to compute a logistic regression classifier and evaluate it’s performance using the receiver operating characteristic (ROC) curve.

>>> ds_train = ds_example3.split(0, [80, 20])
>>> ds_test = ds_example3.split(1, [80, 20])
>>> glm = analytics.generalized_logreg(['x1'], 'y', ds_train, affine=True, l1reg=0, l2reg=0.01).run()
>>> cc = analytics.roc(glm, ['x1'], 'y', ds_test, thresholds=32).run()

Plot the ROC and display the area under the ROC:

>>> plt.figure()
>>> plt.plot(cc.fpr, cc.tpr, label='ROC curve (area = %0.2f)' % cc.auc_roc)
>>> plt.plot([0, 1], [0, 1], 'k--')
>>> plt.xlabel('False Positive Rate')
>>> plt.ylabel('True Positive Rate')
>>> plt.title('Receiver operating characteristic example')
>>> plt.legend(loc="lower right")
>>> plt.show()
ROC curve

Training a Random Forest

In this example we train a random forest classifier on a binary classification problem associated to two overlapping gaussian distributions centered at (0,0) and (3,3). Points around (0,0) are labeled as in the negative class while points around (3,3) are labeled as in the positive class.

>>> ds_example4 = DataSet.from_table('tutorial.twoclass')
>>> rf = analytics.random_forest(['x1', 'x2'], 'y', ds_example4, 100, 1).run()

>>> plot_colors = "br"
>>> plot_step = 0.1
>>>
>>> x_min, x_max = 1.5-8, 1.5+8
>>> y_min, y_max = 1.5-8, 1.5+8
>>> xx, yy = np.meshgrid(
...     np.arange(x_min, x_max, plot_step),
...     np.arange(y_min, y_max, plot_step)
... )
>>> Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
>>> Z = Z.reshape(xx.shape)

Plot the decision boundary:

>>> fig, ax = plt.subplots()
>>> plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
>>> # Draw circles centered at the gaussian distributions
>>> ax.add_artist(plt.Circle((0,0), 1.5, color='k', fill=False))
>>> ax.add_artist(plt.Circle((3,3), 1.5, color='k', fill=False))
>>> ax.text(3, 3, '+')
>>> ax.text(0, 0, '-')
>>> plt.xlabel('x')
>>> plt.ylabel('y')
>>> plt.title('Decision Boundary')
Decision boundary

This concludes the user tutorial section, so the connection should be closed.

>>> client.close()
>>> client.connected
False

Management and Administration

Administration tasks use the Client class from the leapyear module and admin classes from the leapyear.admin. These admin classes include:

These classes provide API’s for various administrator tasks on the LeapYear system. All of the examples in the administrative examples section will require correct permissions.

Managing the LeapYear Server

Management requires sufficient privileges. The examples below assume the root user is an administrator of the LeapYear deployment system.

>>> client = Client(url, 'root', ROOT_PASSWORD)
>>> client.connected
True

User Management

User objects are used as the primary API for managing users. Below is an example of a user being created, their password updated, and finally their account is disabled.

>>> # Create the user
>>> user = User('new_user', password)
>>> client.create(user)
>>> 'new_user' in client.users
True
>>>
>>> # Update the user's password
>>> new_password = '{}100'.format(password)
>>> user.update(password=new_password)
<User new_user>
>>>
>>> # Disable the user
>>> user.enabled
True
>>>
>>> user.enabled = False
>>> user.enabled
False

Database Management

Database objects are used to view and manipulate databases on the server.

>>> # create database
>>> client.create(Database('sales'))
>>>
>>> # retrieve a reference to the database
>>> sales_database = client.databases['sales']
>>>
>>> # drop database
>>> client.drop(sales_database)

Table Management

Table objects are used to view and manipulate tables in a database on the server. Below is an example of how to define a data source (table) object on the LeapYear server.

>>> credentials = 'hdfs:///path/to/data.parquet'
>>>
>>> # create a table
>>> accounts = Database('accounts')
>>> table = Table('users', credentials=credentials, database=accounts)
>>>
>>> client.create(accounts)
>>> client.create(table)
>>>
>>> # retrieve a reference to the table
>>> users_table = accounts.tables['users']
>>>
>>> # drop a table
>>> client.drop(users_table)