Skip to content

mpandey95/airflow-data-pipelines

Repository files navigation

Airflow Data Pipelines

Production-grade data orchestration pipelines using Apache Airflow to manage Airbyte ELT jobs and dbt transformations on Google Cloud Platform.

Automates secure data extraction, loading, and transformation for analytics and reporting.

Use Case

This repository is designed for orchestrating end-to-end data pipelines. Rather than manually running extracts and transformations, data engineers can leverage this system to dynamically schedule and manage workflows. By integrating Apache Airflow with Airbyte (for EL) and dbt (for T), it ensures data pipelines run reliably, efficiently, and provide detailed logging and alerting (via Microsoft Teams integrations).

Minimum Requirements

To build and execute this project, the following minimum requirements must be met:

  • Python: 3.8 or higher
  • Orchestration: Apache Airflow 2.x
  • Cloud Access & Credentials:
    • GCP: Service account with permissions for Google Cloud Storage (GCS), BigQuery, Cloud Build, and Compute Engine.
    • Databases: Active connections to internal/external databases.
    • System: SSH keys to securely execute dbt transformations on remote VMs.

Architecture

flowchart TD
    Airflow([Apache Airflow DAGs])
    
    subgraph DataSources ["Data Sources"]
        GCS[("Raw GCS Files")]
        Webhooks[("Power Automate / Webhooks")]
        Firestore[("Firestore")]
    end
    
    subgraph GCP ["Google Cloud Platform"]
        CloudBuild["Cloud Build (CI/CD)"]
        ComputeAirflow["Airflow VM"]
        ComputeDBT["dbt VM"]
        BigQuery[("BigQuery (Data Warehouse)")]
    end
    
    subgraph AirbyteCloud ["Airbyte"]
        AirbyteSync["Airbyte Sync Jobs"]
    end

    CloudBuild -->|Deploys DAGs| ComputeAirflow
    Airflow -->|Triggers Data Sync| AirbyteSync
    DataSources -->|Extracted by| AirbyteSync
    AirbyteSync -->|Loaded into| BigQuery
    Airflow -->|Executes via SSH| ComputeDBT
    ComputeDBT -->|Transforms Data in| BigQuery
Loading

Tech Stack

  • Framework: Apache Airflow
  • Language: Python
  • Tools: Airbyte, dbt
  • Deployment: Google Cloud Build, Google Compute Engine
  • Cloud Integrations: Google Cloud Storage, BigQuery, Pub/Sub

Features

  • Automated provisioning of ELT workflows
  • Trigger-based capabilities validating GCS file updates against BigQuery states
  • Remote execution of dbt transformations via SSH on target VMs
  • Microsoft Teams webhook alerting for failed DAGs
  • CI/CD deployment via Cloud Build

Project Structure

.
├── cloudbuild.yaml                             # Google Cloud Build CI/CD pipeline configuration
├── README.md                                   # Project documentation
├── include/
│   └── teams_alert.py                          # Alerting module for MS Teams notifications
├── firestore_master.py                           # DAG for Firestore -> GCS -> BigQuery pipelines
├── nach_pull_airbyte_dbt.py                    # DAG for NACH pull orchestration
├── power_automate_airbyte_insurance.py         # DAG triggered by Power Automate webhooks
├── ssh_dbt_mgpayment_staging.py                # DAG for executing MGPayment transformations
└── teleaccess_master.py                        # Master DAG for Teleaccess data processing

Step-by-Step Execution Guide

Follow these steps to deploy or run the DAGs in your local/dev Airflow environment:

1. Clone the Repository

Open your terminal and clone the repository, then navigate into the project directory:

git clone https://github.com/mpandey95/airflow-data-pipelines.git
cd airflow-data-pipelines

2. Configure Airflow Connections

Create the following connections in your Airflow UI:

  • gcp_prod: Google Cloud connection for GCS and BigQuery.
  • airbyte_prod: Connection pointing to your Airbyte instance.
  • ssh_dbt_vm: SSH connection configured for your remote dbt transformation server.

3. Deploy DAGs Locally

Copy the python files to your AIRFLOW_HOME/dags directory to make them visible to the Airflow scheduler:

cp *.py $AIRFLOW_HOME/dags/
cp -r include/ $AIRFLOW_HOME/dags/

Deployment (GCP)

This repository leverages Google Cloud Build for automated CI/CD.

Method 1: Automated Deployment via Cloud Build

The cloudbuild.yaml handles a pull/copy CI/CD workflow.

  1. Ensure your Cloud Build trigger defines the following Substitution Variables in GCP:
    • _VM_NAME: Target Airflow VM.
    • _VM_ZONE: GCP zone.
  2. When triggered, the pipeline:
    • Cleans the staging directory on the VM (/root/airflow/cicd-staging).
    • Uses gcloud compute scp to copy new DAGs to the remote machine.
    • Moves files from staging to the active /root/airflow/dags directory.

Method 2: Manual Trigger via Google Cloud CLI

You can manually run the Cloud Build pipeline via CLI:

gcloud builds submit --config cloudbuild.yaml . --substitutions=_VM_NAME="your-vm-name",_VM_ZONE="your-vm-zone"

Manish Pandey — Senior DevOps/Platform Engineer

🛠️ Technology Stack

☁️ Cloud & Platforms

GCP AWS

⚙️ Platform & DevOps

Kubernetes Docker Terraform Helm Ansible CI/CD

🔐 Security & Ops

IAM Networking Monitoring Secrets Management

🧑‍💻 Programming

Python Bash YAML

💾 Database

SQL MongoDB

Connect With Me

License

See LICENSE | Support: GitHubLinkedIn

About

Centralized Airflow data pipelines on GCP: Orchestrates Airbyte syncs, dbt transformations, and BigQuery loading with Cloud Build CI/CD and Teams alerting.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages