Data Engineering Exercise

Description

Current project is a solution of a test exercise for data engineers which also includes some data analytics questions.
The task was used for gaining hands on experience with Metaflow. Metaflow is a newly open-sourced tool for building Data Engineering pipelines by Netflix.
The solution is structured applying best software development practices to a Data Science project according to this article

The task

Specified by this document

The solution

Consists of

Jupyter notebooks containing data exploration work.
pipeline - a chain of transformations applied to input data in order to answer questions of the tasks.
actual transformations and their tests!

What is the pipeline?

It is an ETL. Nothing more then a chain (or acyclic graph) of 3 types of operations:

Extract (data from some source)
Transform (that data)
Load (these data into the other source)

Project structure

Project is structured according to the idea given in the article.

Folder/File	Content
data	contains input data of the task stored locally for simplicity
notebooks	every data-related project always requires data exploration and analysis before actual implementation is started. Jupyter notebooks is the common tool for making this exploration. Since Jupyter is a separate instrument, different from usual programming routine it needs a separate place in the project.
pipeline.py	our main delivarable - the implementation of required ETL with Metaflow
transformations.py	logic of data transformations organized in small reusable functions
tests	every transformation has a test proving it's correctness!

Environment setup

1. Setup Data Science environment

The easiest way to get all the data science tools in one go is to install conda.

brew cask install anaconda
export PATH="/usr/local/anaconda3/bin:$PATH"

It will install own version of python with all required packages and tools.

Make sure that your system now has this python as default: /usr/local/anaconda3/bin/python

The easiest way for a developer to work in that new environment is to use VSCode with Python extension installed. Latest versions of this extension are executing Jupyter notebooks out of the box. Executing any piece of python code, make sure you are using anaconda's python.

2. Setup metaflow

Using anaconda's python simply execute:

pip install metaflow

Running the solution

Steps of notebooks must just run from VSCode with our setup.

Execute pipeline

From project's root
python pipeline.py run

Additional: Redraw the graph of the pipeline

Make sure you have GraphViz installed in anaconda.

From project's root
python pipeline.py output-dot | dot -Grankdir=TB -Tpng -o graph.png

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
notebooks		notebooks
tests		tests
.gitignore		.gitignore
Questions-Answered.md		Questions-Answered.md
Questions.md		Questions.md
README.md		README.md
graph.png		graph.png
pipeline.py		pipeline.py
transformations.py		transformations.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Exercise

Description

The task

The solution

What is the pipeline?

Project structure

Environment setup

1. Setup Data Science environment

2. Setup metaflow

Running the solution

Execute pipeline

Additional: Redraw the graph of the pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Exercise

Description

The task

The solution

What is the pipeline?

Project structure

Environment setup

1. Setup Data Science environment

2. Setup metaflow

Running the solution

Execute pipeline

Additional: Redraw the graph of the pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages