Production-grade data orchestration pipelines using Apache Airflow to manage Airbyte ELT jobs and dbt transformations on Google Cloud Platform.
Automates secure data extraction, loading, and transformation for analytics and reporting.
This repository is designed for orchestrating end-to-end data pipelines. Rather than manually running extracts and transformations, data engineers can leverage this system to dynamically schedule and manage workflows. By integrating Apache Airflow with Airbyte (for EL) and dbt (for T), it ensures data pipelines run reliably, efficiently, and provide detailed logging and alerting (via Microsoft Teams integrations).
To build and execute this project, the following minimum requirements must be met:
- Python: 3.8 or higher
- Orchestration: Apache Airflow 2.x
- Cloud Access & Credentials:
- GCP: Service account with permissions for Google Cloud Storage (GCS), BigQuery, Cloud Build, and Compute Engine.
- Databases: Active connections to internal/external databases.
- System: SSH keys to securely execute dbt transformations on remote VMs.
flowchart TD
Airflow([Apache Airflow DAGs])
subgraph DataSources ["Data Sources"]
GCS[("Raw GCS Files")]
Webhooks[("Power Automate / Webhooks")]
Firestore[("Firestore")]
end
subgraph GCP ["Google Cloud Platform"]
CloudBuild["Cloud Build (CI/CD)"]
ComputeAirflow["Airflow VM"]
ComputeDBT["dbt VM"]
BigQuery[("BigQuery (Data Warehouse)")]
end
subgraph AirbyteCloud ["Airbyte"]
AirbyteSync["Airbyte Sync Jobs"]
end
CloudBuild -->|Deploys DAGs| ComputeAirflow
Airflow -->|Triggers Data Sync| AirbyteSync
DataSources -->|Extracted by| AirbyteSync
AirbyteSync -->|Loaded into| BigQuery
Airflow -->|Executes via SSH| ComputeDBT
ComputeDBT -->|Transforms Data in| BigQuery
- Framework: Apache Airflow
- Language: Python
- Tools: Airbyte, dbt
- Deployment: Google Cloud Build, Google Compute Engine
- Cloud Integrations: Google Cloud Storage, BigQuery, Pub/Sub
- Automated provisioning of ELT workflows
- Trigger-based capabilities validating GCS file updates against BigQuery states
- Remote execution of dbt transformations via SSH on target VMs
- Microsoft Teams webhook alerting for failed DAGs
- CI/CD deployment via Cloud Build
.
├── cloudbuild.yaml # Google Cloud Build CI/CD pipeline configuration
├── README.md # Project documentation
├── include/
│ └── teams_alert.py # Alerting module for MS Teams notifications
├── firestore_master.py # DAG for Firestore -> GCS -> BigQuery pipelines
├── nach_pull_airbyte_dbt.py # DAG for NACH pull orchestration
├── power_automate_airbyte_insurance.py # DAG triggered by Power Automate webhooks
├── ssh_dbt_mgpayment_staging.py # DAG for executing MGPayment transformations
└── teleaccess_master.py # Master DAG for Teleaccess data processing
Follow these steps to deploy or run the DAGs in your local/dev Airflow environment:
Open your terminal and clone the repository, then navigate into the project directory:
git clone https://github.com/mpandey95/airflow-data-pipelines.git
cd airflow-data-pipelinesCreate the following connections in your Airflow UI:
gcp_prod: Google Cloud connection for GCS and BigQuery.airbyte_prod: Connection pointing to your Airbyte instance.ssh_dbt_vm: SSH connection configured for your remote dbt transformation server.
Copy the python files to your AIRFLOW_HOME/dags directory to make them visible to the Airflow scheduler:
cp *.py $AIRFLOW_HOME/dags/
cp -r include/ $AIRFLOW_HOME/dags/This repository leverages Google Cloud Build for automated CI/CD.
The cloudbuild.yaml handles a pull/copy CI/CD workflow.
- Ensure your Cloud Build trigger defines the following Substitution Variables in GCP:
_VM_NAME: Target Airflow VM._VM_ZONE: GCP zone.
- When triggered, the pipeline:
- Cleans the staging directory on the VM (
/root/airflow/cicd-staging). - Uses
gcloud compute scpto copy new DAGs to the remote machine. - Moves files from staging to the active
/root/airflow/dagsdirectory.
- Cleans the staging directory on the VM (
You can manually run the Cloud Build pipeline via CLI:
gcloud builds submit --config cloudbuild.yaml . --substitutions=_VM_NAME="your-vm-name",_VM_ZONE="your-vm-zone"Manish Pandey — Senior DevOps/Platform Engineer
- GitHub: @mpandey95
- LinkedIn: manish-pandey95
- Email: mnshkmrpnd@gmail.com