Skip to content

davidgonzalez95/Azure_Project_03_End-to-End-Data-Engineering_NetflixDatastreaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

105 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Azure End to End Data Engineering Netflix Data streaming Project

Table of Contents

  1. Project Description
  2. Technical Components
  3. Data Architecture
  4. Azure Data Factory (Ingestion)
  5. Azure Databricks
  6. Power Bi and Azure Synapse

Project Description

This project focuses on building a comprehensive, end-to-end Azure Data Engineering solution that seamlessly integrates streaming and batch data ingestion, transformation, and analytics. It follows the principles of the Medallion Architecture, ensuring a structured and scalable approach to data processing. By implementing this architecture, the solution facilitates an efficient and organized data flow, transitioning from raw ingestion to progressively refined and enriched datasets. These optimized datasets will ultimately support real-time and batch analytics, enabling stakeholders to derive meaningful insights and make data-driven decisions with confidence.

Technical Components

  • GitHub: Data source.
  • Azure Data Lake: Centralized storage for transformed data.
  • Azure Data Factory (ADF): Data ingestion.
  • Databricks with Delta Live Tables: Scalable data transformation with real-time processing and automated batch/streaming management.
  • Azure Synapse Analytics: Data warehouse.
  • Power BI: Reporting.

Data Architecture

image

Azure Data Factory (Ingestion)

Objective

The purpose of this section in the README is to explain the pipeline implemented in Azure Data Factory, detailing their structure and functionality for data ingestion. The main goal is to highlight how the architecture is designed to be efficient.

1- PL_Extract_Data:

Extracting all Netflix files on GitHub except for the titles file already stored in the 00-raw Container, the extraction is done using a dynamic copy parameter to extract the URL path and destination of the file inside a forEach activity that reads the corresponding URL and loads the data via a parameter. Before that, a validation of the existence of the title file in the 00-raw container is carried out.

image

Steps:
  • Creation a Dynamic Copy Activity:

    1- Creation of source connection:

    image image image image

    2- Creation of sink connection:

    image image image
  • Creation of a parameter inside of the pipeline:

    1- Create a JSON file: It is used a JSON to create dynamic parameters that automate the extraction and loading of data. The structure of the JSON is broken down below:

    • folder_name: Target folder.

    • file_name: Target file name and format for source and sink.

      Then it is uploaded it into our Data Lake in the parameters folder. Format of JSON

    image
  • Creation of forEach Activity and put inside the Dynamic Copy: (Extract the values from the parameter which uses the json script)

    image
  • Creation of Validation Activity: (Check if the title folder exist in 00-raw folder)

    image image
  • PL_Extract_Data results:

    Bronze Folder:

    image

Azure Databricks

Architecture:

In this project, we have implemented a Medallion Architecture (Bronze → Silver → Gold) using two different methodologies to process the data:

  • Workflows for the Bronze and Silver layers
  • Delta Live Tables (DLT) for the Gold layer

Although this project demonstrates both methodologies and allows for a comparison of their advantages and differences, in a real-world scenario, the ideal approach would be to implement DLT across all layers (Bronze, Silver, and Gold) because:

Automates execution and eliminates the need to manually manage dependencies.
Natively supports both streaming and batch processing within the same structure.
Includes built-in data quality rules (EXPECT), removing the need for external validations.
Integrates with Unity Catalog for better governance and security.
Automatically optimizes performance with AUTO OPTIMIZE and AUTO COMPACT.

Unity Catalog - Objective:

To ensure secure and efficient data governance, Unity Catalog is utilized for managing credentials and access controls across different data layers. Unity Catalog provides a centralized approach to defining permissions, enabling fine-grained access control for users, groups, and service principals. Through its integration with cloud identity providers, it allows organizations to establish secure authentication mechanisms and enforce role-based access (RBAC). Additionally, Unity Catalog simplifies credential management by enabling secure connections to storage accounts, ensuring that only authorized entities can read or write data while maintaining compliance with enterprise security policies.

Steps:

  • Creation of an access databricks:

    image image
  • Creation of a credential:

    image
  • Creation of external tables:

    image image image image image

Ingestion - Objective:

The 00-raw container will be constantly loading new Netflix titles files. To achieve this, an Incremental Data Loading using AutoLoader will be implemented, creating a checkpoint to track which files have been loaded and which have not. The checkpoint is stored in a dedicated container separate from the data layers to ensure data consistency and avoid unintended deletions due to lifecycle policies. This setup guarantees reliable tracking of processed files without interfering with the raw, silver, or gold layers, as recommended by Microsoft's best practices.

After performing checks on the average size of the files to be loaded, a duration of 2 minutes has been set for the process. Subsequently, a workflow will be created along with its respective trigger to automate and manage the data loading process efficiently.

Steps:

  • It tests how much time is needed to process a file

    image
  • Creation of a workflow

    image
  • Creation of a trigger

    image
  • Workflow Result:

    image

The following notebook involves reading and writing data in a data stream using Apache Spark, specifically to work with CSV files stored in an Azure Data Lake Storage (ADLS).

01_Bronze_AutoLoader

  • Storage result:

    Bronze Folder:

    image

    Checkpoint-state Folder:

    image

Transformations - Objective:

The silver layer has been implemented in 2 independent workflows:

1-Workflow: All files are loaded except for the title files, and the reference notebooks are as follows:

02_01_Silver_LookUp

02_02_Silver_forEach

Steps:

2-Workflow: In this case, an independent workflow has been created for the title files due to their incremental loading, and the reference notebooks are as follows:

02_03_Silver_LookUp

02_04_Silver_forEach

02_05_Silver_False_Notebook

Steps:

Delta Live tables - Objective:

Delta Live Tables (DLT) is a declarative framework in Databricks that simplifies the development, execution, and monitoring of ETL pipelines. Unlike traditional approaches, DLT automates pipeline orchestration, manages dependencies, and ensures data quality and reliability.

Steps:

  • Create a DLT_notebook: 03_Gold_DLT

  • Create a DLTPipeline:

    image image image image
  • DLT Pipeline Result:

    image image image image

Power Bi and Azure Synapse

Objective:

Finally, after all processing steps, the Gold layer data is exported to Power BI or Azure Synapse for report generation and visualizations. The connection with these two tools is made directly from Databricks.

Steps:

  • Create a Power BI connection: Power_BI image

    image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors