Create Your First ETL Pipeline with Python

Data Engineering Fundamentals with Python

·

6 min read

When it comes to pursuing a career in the field of Data and specifically Data Engineering and many other tech-related fields, Python comes off as a powerful tool. As you will be forging ahead in your profession, this programming language will be convenient in many ways.

Without further ado, let’s dive into the fundamentals of Python that are needed to create your first ETL Pipeline!

A Demonstration of the ETL Process using Python

It may be helpful to use an actual bare-bones example to illustrate how to build an ETL pipeline to gain a better understanding of the subject. With this, we will better understand how easy Python is to use as a whole.

Create a file called etl.py in the text editor of your choice. And add the following docstring.

"""
Python Extract Transform Load Example
"""

We will begin with a basic ETL Pipeline consisting of essential elements needed to extract the data, then transform it, and finally, load it into the right places. At this step, things are not as complex as they might seem, even if you are a complete beginner at it.

So as we go down the path, you can witness how easy it is to use python for building any such ETL Pipelines.

Importing the Right Packages

import requests
import pandas as pd
from sqlalchemy import create_engine

Coming to this step, you will realize how Python is resourceful as a tool. Because it carries an ecosystem of libraries around data and general programming which makes it even more painless, as well as, effective to use. Importing the right libraries is the first step to creating anything using python.

Here, we will require the use of the first three libraries known as ‘request’ libraries. They help to pull data from an API, which is used for the extraction of data. Apart from that, Pandas is another library to perform transformation and manipulation of data. This is similar to Excel on steroids with the only difference being that it is based on using codes. Hence, pandas can be used to read data in excel formats, and CSV formats (basically, anything in a tabular format), and we can easily transform it using pandas. The last on this list is SQLAlchemy, which is meant to support creating a connection to a database (essentially, it’s an SQLite database).

Step 1: Extract

def extract()-> dict:
    """ This API extracts data from
    http://universities.hipolabs.com
    """
    API_URL = "http://universities.hipolabs.com/search?country=United+States"
    data = requests.get(API_URL).json()
    return data

The first step to creating the pipeline will begin with the extract. As shown in the sample, we are extracting from an API source that is freely available for use. The sample used is to derive the information on universities available in the United States as a whole. When we run this API, it will provide the data back to us in a JSON format as the sample is shown below.

[{"web_pages": ["http://www.marywood.edu"], "state-province": null, "alpha_two_code": "US", "name": "Marywood University", "country": "United States", "domains": ["marywood.edu"]}, {"web_pages": ["http://www.lindenwood.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Lindenwood University", "country": "United States", "domains": ["lindenwood.edu"]}, {"web_pages": ["https://sullivan.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Sullivan University", "country": "United States", "domains": ["sullivan.edu"]}, {"web_pages": ["https://www.fscj.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Florida State College at Jacksonville", "country": "United States", "domains": ["fscj.edu"]}, {"web_pages": ["https://www.xavier.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Xavier University", "country": "United States", "domains": ["xavier.edu"]}, {"web_pages": ["https://home.tusculum.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Tusculum College", "country": "United States", "domains": ["tusculum.edu"]}, {"web_pages": ["https://cst.edu/"], "state-province": null, "alpha_two_code": "US", "name": "Claremont School of Theology", "country": "United States", "domains": ["cst.edu"]}]

The data we get here is in the bare boost structure being used in HTTPS calls or HTTP calls. We get this data as an output from the URL:

http://universities.hipolabs.com/search?country=United+States”

We will use the request library to achieve the extraction and obtain a response as a JSON, which is a dictionary within Python.

Step 2: Transform

def transform(data:dict) -> pd.DataFrame:
    """ Transforms the dataset into desired structure and filters"""
    df = pd.DataFrame(data)
    print(f"Total Number of universities from API {len(data)}")
    df = df[df["name"].str.contains("California")]
    print(f"Number of universities in california {len(df)}")
    df['domains'] = [','.join(map(str, l)) for l in df['domains']]
    df['web_pages'] = [','.join(map(str, l)) for l in df['web_pages']]
    df = df.reset_index(drop=True)
    return df[["domains","country","web_pages","name"]]

This step is mainly about transforming the data to be in the right format and sequence. Mostly, the transformation of any data is done around particular business conditions and their requirements. For this specific sample, we have assumed a hypothetical condition where we are searching for universities in California.

Firstly, we will find data located in a dictionary. This data then will be read into a pandas data frame. But what do pandas do? To elaborate, the Pandas data frame is like a data structure, which means it’s a library that will enable us to convert this dictionary into a data frame. Further, we can think of a data frame as a CSV which has rows and columns and various added functionalities. It’s a comprehensive tool when it comes to transforming data

Next, we will filter out [Line 5 in the snippet] all the universities whose name contains “California”. As mentioned before, Pandas is like Excel on Steroids where we can use a simple syntax that helps us with any such actions required to filter out or transform the data.

Step 3: Load

def load(df:pd.DataFrame)-> None:
    """ Loads data into a sqllite database"""
    disk_engine = create_engine('sqlite:///my_lite_store.db')
    df.to_sql('cal_uni', disk_engine, if_exists='replace')

The last step of creating this pipeline is about reading the SQLite database which pre-exists on the disk. This type of data can be found on a host or a server. We will then save this data frame into a table by using “df.to_sql” which is a functionality of the pandas object. Here, we will further have to provide the SQLite engine with a condition that if such data exists, it shall ‘replace’ it (as highlighted in the image below).

This was the last step of loading after which we can have the transformed data as we perform the load into a database. These are the functional aspects of a programming language where we will be able to reuse this set again. This is a complete ETL pipeline consisting of all the elements to perform any such actions on the data.

Running your first ETL pipeline

data = extract()
df = transform(data)
load(df)

Finally, we will need to execute this function that we need this pipeline to perform. As a result, we will get a data frame with all the columns and at this point, we will have to load it into an SQLite database. Now, what’s left to be done is to run it using the ‘shift’ & ‘enter’ keys.

While it will take some time to execute the code and run the API, we will get to extract filtered-out data about the total number of universities in the United States. Along with this, we will achieve the result for the total number of universities in California. We can further store and save this transformed data into an SQLite database in the file explorer. For any future requirements, we can use the same ETL pipeline and search for the data that we need.

This is simply how the ETL process works using Python to achieve whatever we want to extract out of data.

Conclusion

A programming language as versatile as Python is marked under the essentials by many data engineers, data scientists, and developers, including software engineers. Therefore, as a beginner in the field of data engineering, it is a must-have skill to have a core knowledge of python. However, it is not necessary to know everything under the sun when it comes to python. I’d like to emphasize more that we will always be learning as we go, so no need to panic and try to gulp down everything as it’s not required at all!

Check out the video link below that talks about the same for a better explanation!

Source Code