2.3 ETL Pipeline Tutorial

How to automate data preparation for Machine Learning algorithms? Simple ETL Pipeline tutorial using Kaholo, MySQL and SQL Queries.

 

1. Introduction
How can data preparation be problematic and time-consuming?

 Cleaning big data was proudly titled “Most Time-Consuming, Least Enjoyable Data Science Task” in BI Analyst’s life, according to the CrowdFlower survey. Why? Let me tell you.

Let’s imagine a couple of datasets from various sources. One is in JSON format, another one is CSV, another one is just stored in an SQL table. Some of them have the pieces of information we need, but it may turn out that the reports generated by our connected applications have many columns of data that we don’t need in our algorithm. They are not something we currently want to explore but are part of the source’s data.

For these very practical reasons, we want to make it as convenient as possible to create tailored datasets effortlessly. And with automation tools, we will be able to do this faster than doing it manually. Once we connect the appropriate database and do a short configuration, the Kaholo pipeline can be used for a wide variety of different data transformations with just a few tweaks.

In this short tutorial, I’ll show you how to create one universal automation workflow for many datasets. We’ll create an ETL pipeline using Kaholo, a MySQL database and SQL queries. We’ll configure it to work for various datasets and schedule it to execute regularly for a machine learning pipeline. So… let’s go! 

 

What are ETL pipelines?

ETL stands for Extract-Transform-Load.

It’s basically a set of tasks that need to be done in order to get your data in the shape you want it. The stage called “Transform” is the main objective of what we want to do in this tutorial. 

 

What are the steps to get your data ready for a Machine Learning algorithm?

Although we’ll create all three stages of ETL in this pipeline, it is important to focus the most on the middle stage – data transformation. This is the stage where different combinations of data transformations can be created. Because the majority of machine learning algorithms require data to be formatted in a very specific way, datasets usually require preparation before they can provide useful information. If the data is missing or incorrect, the algorithm returns less accurate or even misleading results. 

For this reason, we will manipulate the data according to the 4 basic principles of data preparation for machine learning algorithms:

  •   Combining data from various sources
  •   Cleaning data
  •   Removing duplicate data
  •   Creating dummy variables to convert categorical data into numerical data. Otherwise, mixing the two data types can be problematic.

 We will execute these operations with queries in an SQL language (MSSQL or MySQL) using Kaholo. If your preference is a NoSQL query language, Kaholo also offers the possibility to execute queries for MongoDB or retrieve data from AWS S3.

 

2. What is needed?

 You need to have a running MySQL database. You can use a free tool db4free.net or set up your own one running on localhost for the sake of following along with this tutorial.

 To create a connection string for a MySQL, you need to prepare the following information:

  •   Database name 
  •   Database user id and password
  •   Server address
  •   Port number

An example connection string looks like this:

 Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword;

To learn more about generating a connection string click here.

 

3. Step by step tutorial with Kaholo

 In this example, two tables are used – “scores” and “genders”. We will download the data set from the public Kaholo repository. 

The table named “scores” contains the points scored by individuals in a bowling game. The table called “genders” consists of the categorical data in the string format – M for male and F for female. The “scores” table is located in the MySQL database, but the “genders” data is in the CSV format.

As you can see, there are empty values, non-numeric ones, and duplicates among the data.

 

 The final effect we are aiming for looks like the following:

 

 

Let’s get started!

First, if you’re not currently a Kaholo user, create a free account here, then create a new project and a new pipeline.

I. Clone the CSV file

Add the Git plugin onto the canvas. Choose the Clone method which will copy the data from our Github repository to the Kaholo agent. It is a public repository, so credentials are not required. You can leave the fields ”Username” and “Password (Vault)” blank.

Specify the URL of the repository in the field called “Repository” as ​​https://github.com/getkaholo/ETL and the local folder path in which to clone in the field called “Clone Path” should be “/data”. The plugin automatically creates a new directory on the agent. It is encouraged to specify a path that does not yet exist to avoid conflicts with other pipeline executions. 

If you wish to check if the repository was cloned properly you can add this temporary step. 

Add the CommandLine plugin onto the canvas. Choose the method called “execute command”. To execute a command in the data folder, type in the “Working Directory” field of the plugin:

/data

In the “Command” field of the plugin type the command below:

ls

Click on the Execute button and you’ll automatically be redirected to the Execution Results page. The execution result should look like this:

You can delete this step after performing the check.

 

II. Create configuration

Go to the “configuration” tab and insert the following code:

{

   "server": "",

   "port": "",

   "userId": "",

   "password": "",

   "database": ""

}

It is a JSON file that specifies the information about your database. Configurations will allow us to store different databases so we can execute the same pipeline for various scenarios. Fill in all the empty fields and click save.

 

III. Create a connection string function

This JavaScript function creates a connection to the MySQL database based on the information you have provided in the JSON file in Configurations. It can be used to link multiple databases to Kaholo. You can obtain it by creating multiple configurations and switching between them. The list of all the configurations can be found in the right corner of Kaholo, next to the “Execute” button.

Go to the main “code” layer and insert the following code which will generate a MySQL connection string dynamically when the pipeline is executed using the Configuration that you select:

function getConnectionString() {

   return `Server=${kaholo.execution.configuration.server};

   Port=${kaholo.execution.configuration.port};

   Database=${kaholo.execution.configuration.database};

   Uid=${kaholo.execution.configuration.userId};

   Pwd=${kaholo.execution.configuration.password};`;

}

 

IV. Connect to a database

Add the MySQL plugin onto the canvas. Switch the Connection String parameter to code by clicking on the Code toggle switch. This allows us to fill this value dynamically at execution time. Fill the connection string parameter to connect to your MySQL database by invoking the function we created on the main code layer. This makes it easier to update it in one place for all instances of this Action in the pipeline.

 

V. Create a new table named “genders”

Add another MySQL plugin onto the canvas. This step will create an empty table, where data from the CSV file will be inserted. The structure of the “genders” table needs to be the same as the CSV file. Choose the method called “Execute Query” and add the “getConnectionString()” function to the “Connection String (Vault)” field switched to Code, as in the previous step. Copy the query below and paste it into the “Query String” field of the Action. 

CREATE TABLE `genders` ( `PersonID` INT NOT NULL , `Gender` VARCHAR(1) NOT NULL )

 

VI. Insert data to the “genders” table

In the MySQL documentation, it explains that the “LOAD DATA LOCAL INFILE” statement allows you to read data from a text file and import the file’s data into a database table very quickly.

Before importing the file, you need to prepare the following:

  • A CSV file with data that matches the number of columns of the table and the type of data in each column.
  • The account, which connects to the MySQL database server, has FILE and INSERT privileges.

 

Add the following SQL query to the field “Query String” of the plugin:

LOAD DATA LOCAL INFILE '/data/genders.csv' 

INTO TABLE genders 

FIELDS TERMINATED BY ',' 

LINES TERMINATED BY '\n'

IGNORE 1 ROWS;

 

The fields in the file are terminated by a comma indicated by  FIELD TERMINATED BY ‘,’ .

Each line of the CSV file is terminated by a newline character indicated by LINES TERMINATED BY ‘\n’ .

Because the file has the first line that contains the column headings, which we don’t want imported into the table, we ignore it by specifying the IGNORE 1 ROWS option.

The effect of this step looks like the following:

 

VII. Data preparation 
1. Combine the data from various sources

 In this Action, we’ll create a new table. We’ll name it ‘scores_clone1’. Because SQL queries only allow you to see the transformation result, we will store them in the cloned tables to be able to perform the next steps of the pipeline. 

Clone the “Insert Data” MySQL plugin, so it uses the same “Execute Query” method and connection string by right clicking on the “Insert Data” action and clicking the “Clone” option.

Using the “join” command, we’ll consolidate data from the database table with data that was previously loaded from a CSV format file. To do that, insert the following query in the “Query String” parameter in the new “Combine Datasets” Action.

create TABLE scores_clone1 AS SELECT a.PersonID, a.PersonName, a.Score, b.Gender FROM scores a

JOIN genders b ON a.PersonID=b.PersonID;

The effect of this transformation looks like the following:

 

 

 

2. Clean the data – delete rows with empty fields

 As is common with datasets, we need to remove the rows with empty data. Otherwise, the ML algorithm could produce results that are not correct. 

This step of the pipeline removes all the rows that have blank or NULL fields. Clone the previously used MySQL plugin, so it uses the same “Execute Query” method and connection string.

Insert the following query in the “Query String” parameter in the new “Delete rows with empty fields” Action:

 DELETE FROM scores_clone1 WHERE PersonName IS NULL OR PersonName = '' OR Score IS NULL OR Score = '' OR Gender IS NULL OR Gender = '';

 

 

The effect of this transformation looks like the following:

 

 

3. Remove duplicate data

Duplicate data can also be a problem, which can easily change the results returned by the ML algorithm. By processing duplicate data, we may be processing a dataset with completely different characteristics (skewed mean, median, variance) from the real dataset, so its results would have no application in the real world.

By using a “distinct” command we’ll pass only the unique data rows to the next clone table which we’ll name scores_clone2. 

Clone the previously used MySQL plugin, so it uses the same “Execute Query” method and connection string. Insert the following query in the “Query String” parameter in the new “Remove Duplicates” Action:

create TABLE scores_clone2 AS select DISTINCT * from scores_clone1;

 

The effect of this transformation looks like the following:

 

 

4. Create dummy variables

 A dummy variable is one that takes only the value 0 or 1. Sometimes it is useful to convert a categorical variable to a set of indicators that just show the presence or the absence of the values of the variable.

To do that, we will create a new column in the table for each distinct value and use an IF function. In our example 0 stands for male, 1 for female.

 Dummy variables are widely used in statistics and econometrics, particularly in regression analysis. Encoding all of the categorical variables as numerical dummy variables allows easy interpretation and calculation of the odds ratios, and increases the stability and significance of the coefficients.

Clone the previously used MySQL plugin, so it uses the same “Execute Query” method and connection string.

Insert the following query in the “Query String” parameter in the new “Create dummy variables” Action which will create a new table called scores_clone3 and convert the gender categories into numbers:

create TABLE scores_clone3 AS select scores_clone2.*, a.Male, a.Female from scores_clone2 JOIN( SELECT PersonID, SUM(IF(Gender = "M", 1, 0)) AS Male, SUM(IF(Gender = "F", 1, 0)) AS Female FROM scores_clone2 GROUP BY PersonID) a ON scores_clone2.PersonID=a.PersonID;

 

The effect of this transformation looks like the following:

 

VIII. Rename the table

Now we can rename the “scores_clone3” table to the “result_table”. 

Clone the previously used MySQL plugin, so it uses the same “Execute Query” method and connection string.

Insert the following query in the “Query String” parameter in the new “Rename Table” Action:

RENAME TABLE scores_clone3 TO result_table;

IX. Delete cloned tables

And finally, we’ll delete all the cloned tables that were created as part of this pipeline that we don’t need anymore. 

Clone the previously used MySQL plugin, so it uses the same “Execute Query” method and connection string. Insert the following query in the “Query String” parameter in the new “Delete Tables” Action:

DROP TABLE IF EXISTS scores_clone1,scores_clone2,scores_clone3;

X. Set the Microsoft Teams notification

If you want to be informed every time your pipeline prepares data, you can set up a Microsoft Teams or Slack notification. To do that, find the “Teams” plugin in the Assets panel on the left side and drag and drop it onto the canvas. Choose the method called “Send Incoming Webhook” and provide the URL in the “Webhook URL” field of Action. 

Insert the message you wish to receive in the “Message” field of Action. 

You can read more about webhooks in the Microsoft Documentation.

 

4. Schedule a Pipeline Execution

 Using Kaholo, you can easily set a recurring execution of the above pipeline or select a specific date when you want it to run. To do that, go to the “Calendar” tab in the left side menu and choose which project and pipeline you want to schedule. Choose the “Repeated” option and then click on the cron field to generate a Cron expression.

 

5. Results

The result of the data transformation is a dataset that can be freely used for training an ML model, or preparing a report:

 

 

6. Summary

 As shown in this tutorial, data preparation for ML training or regular reports can be a time-consuming but easy to automate task. Using simple SQL query tools, the database, and Kaholo, we created a simple pipeline that cyclically provides data in the most convenient format for our needs. Given the complexity of IT systems, machine learning is best suited to automatically and rapidly analyzing massive amounts of data distributed across multiple data stores, identifying patterns to detect anomalies and making predictions. Using automation, the time saved can be used for better analysis of the results or for creating better decision-making strategies for the company. 

After transforming the data, we can also trigger another pipeline that would run an ML training process. If you’re using AWS or GCP, you can do this via the respective CLI plugins or use the AWS S3 plugin to load the prepared data.