Azure End to End Data Engineering Netflix Data streaming Project

Project Description

This project focuses on building a comprehensive, end-to-end Azure Data Engineering solution that seamlessly integrates streaming and batch data ingestion, transformation, and analytics. It follows the principles of the Medallion Architecture, ensuring a structured and scalable approach to data processing. By implementing this architecture, the solution facilitates an efficient and organized data flow, transitioning from raw ingestion to progressively refined and enriched datasets. These optimized datasets will ultimately support real-time and batch analytics, enabling stakeholders to derive meaningful insights and make data-driven decisions with confidence.

Technical Components

GitHub: Data source.
Azure Data Lake: Centralized storage for transformed data.
Azure Data Factory (ADF): Data ingestion.
Databricks with Delta Live Tables: Scalable data transformation with real-time processing and automated batch/streaming management.
Azure Synapse Analytics: Data warehouse.
Power BI: Reporting.

Data Architecture

Azure Data Factory (Ingestion)

Objective

The purpose of this section in the README is to explain the pipeline implemented in Azure Data Factory, detailing their structure and functionality for data ingestion. The main goal is to highlight how the architecture is designed to be efficient.

1- PL_Extract_Data:

Extracting all Netflix files on GitHub except for the titles file already stored in the 00-raw Container, the extraction is done using a dynamic copy parameter to extract the URL path and destination of the file inside a forEach activity that reads the corresponding URL and loads the data via a parameter. Before that, a validation of the existence of the title file in the 00-raw container is carried out.

Steps:

Creation a Dynamic Copy Activity:

1- Creation of source connection:

2- Creation of sink connection:
Creation of a parameter inside of the pipeline:

1- Create a JSON file: It is used a JSON to create dynamic parameters that automate the extraction and loading of data. The structure of the JSON is broken down below:
- folder_name: Target folder.
- file_name: Target file name and format for source and sink.
  
  Then it is uploaded it into our Data Lake in the parameters folder. Format of JSON
Creation of forEach Activity and put inside the Dynamic Copy: (Extract the values from the parameter which uses the json script)
Creation of Validation Activity: (Check if the title folder exist in 00-raw folder)
PL_Extract_Data results:

Bronze Folder:

Azure Databricks

Architecture:

In this project, we have implemented a Medallion Architecture (Bronze → Silver → Gold) using two different methodologies to process the data:

Workflows for the Bronze and Silver layers
Delta Live Tables (DLT) for the Gold layer

Although this project demonstrates both methodologies and allows for a comparison of their advantages and differences, in a real-world scenario, the ideal approach would be to implement DLT across all layers (Bronze, Silver, and Gold) because:

✔ Automates execution and eliminates the need to manually manage dependencies.
✔ Natively supports both streaming and batch processing within the same structure.
✔ Includes built-in data quality rules (EXPECT), removing the need for external validations.
✔ Integrates with Unity Catalog for better governance and security.
✔ Automatically optimizes performance with AUTO OPTIMIZE and AUTO COMPACT.

Unity Catalog - Objective:

To ensure secure and efficient data governance, Unity Catalog is utilized for managing credentials and access controls across different data layers. Unity Catalog provides a centralized approach to defining permissions, enabling fine-grained access control for users, groups, and service principals. Through its integration with cloud identity providers, it allows organizations to establish secure authentication mechanisms and enforce role-based access (RBAC). Additionally, Unity Catalog simplifies credential management by enabling secure connections to storage accounts, ensuring that only authorized entities can read or write data while maintaining compliance with enterprise security policies.

Steps:

Creation of an access databricks:
Creation of a credential:
Creation of external tables:

Ingestion - Objective:

The 00-raw container will be constantly loading new Netflix titles files. To achieve this, an Incremental Data Loading using AutoLoader will be implemented, creating a checkpoint to track which files have been loaded and which have not. The checkpoint is stored in a dedicated container separate from the data layers to ensure data consistency and avoid unintended deletions due to lifecycle policies. This setup guarantees reliable tracking of processed files without interfering with the raw, silver, or gold layers, as recommended by Microsoft's best practices.

After performing checks on the average size of the files to be loaded, a duration of 2 minutes has been set for the process. Subsequently, a workflow will be created along with its respective trigger to automate and manage the data loading process efficiently.

Steps:

It tests how much time is needed to process a file
Creation of a workflow
Creation of a trigger
Workflow Result:

The following notebook involves reading and writing data in a data stream using Apache Spark, specifically to work with CSV files stored in an Azure Data Lake Storage (ADLS).

01_Bronze_AutoLoader

Storage result:

Bronze Folder:

Checkpoint-state Folder:

Transformations - Objective:

The silver layer has been implemented in 2 independent workflows:

1-Workflow: All files are loaded except for the title files, and the reference notebooks are as follows:

02_01_Silver_LookUp

02_02_Silver_forEach

Steps:

Creation of LookUp task by using 02_01_Silver_LookUp
Creation of forEach task by using 02_02_Silver_forEach
Workflow Result:
Storage results (Silver Folder):

2-Workflow: In this case, an independent workflow has been created for the title files due to their incremental loading, and the reference notebooks are as follows:

02_03_Silver_LookUp

02_04_Silver_forEach

02_05_Silver_False_Notebook

Steps:

Creation of LookUp_Weekday task by using 02_03_Silver_LookUp
Creation of ifCondition task:
Creation of a Silver Master task by using 02_04_Silver_forEach
Creation of a FalseNotebook task by using 02_04_Silver_forEach
Workflow Result:
Storage results (Silver Folder):

Delta Live tables - Objective:

Delta Live Tables (DLT) is a declarative framework in Databricks that simplifies the development, execution, and monitoring of ETL pipelines. Unlike traditional approaches, DLT automates pipeline orchestration, manages dependencies, and ensures data quality and reliability.

Steps:

Create a DLT_notebook: 03_Gold_DLT
Create a DLTPipeline:
DLT Pipeline Result:

Power Bi and Azure Synapse

Objective:

Finally, after all processing steps, the Gold layer data is exported to Power BI or Azure Synapse for report generation and visualizations. The connection with these two tools is made directly from Databricks.

Steps:

Create a Power BI connection: Power_BI

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
Codes		Codes
Data		Data
Pictures		Pictures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure End to End Data Engineering Netflix Data streaming Project

Table of Contents

Project Description

Technical Components

Data Architecture

Azure Data Factory (Ingestion)

Objective

1- PL_Extract_Data:

Steps:

Azure Databricks

Architecture:

Unity Catalog - Objective:

Steps:

Ingestion - Objective:

Steps:

Transformations - Objective:

Steps:

Steps:

Delta Live tables - Objective:

Steps:

Power Bi and Azure Synapse

Objective:

Steps:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Azure End to End Data Engineering Netflix Data streaming Project

Table of Contents

Project Description

Technical Components

Data Architecture

Azure Data Factory (Ingestion)

Objective

1- PL_Extract_Data:

Steps:

Azure Databricks

Architecture:

Unity Catalog - Objective:

Steps:

Ingestion - Objective:

Steps:

Transformations - Objective:

Steps:

Steps:

Delta Live tables - Objective:

Steps:

Power Bi and Azure Synapse

Objective:

Steps:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages