Skip to content

mihirkudale/Airbnb_Snowflake_DBT_Data_Engineer_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏠 Airbnb End-to-End Data Engineering Project

πŸ“‹ Overview

This project implements a complete end-to-end data engineering pipeline for Airbnb data using modern cloud technologies. The solution demonstrates best practices in data warehousing, transformation, and analytics using Snowflake, dbt (Data Build Tool), and AWS.

The pipeline processes Airbnb listings, bookings, and hosts data through a medallion architecture (Bronze β†’ Silver β†’ Gold), implementing incremental loading, slowly changing dimensions (SCD Type 2), and creating analytics-ready datasets.

πŸ—οΈ Architecture

Data Flow

Source Data (CSV) β†’ AWS S3 β†’ Snowflake (Staging) β†’ Bronze Layer β†’ Silver Layer β†’ Gold Layer
                                                           ↓              ↓           ↓
                                                      Raw Tables    Cleaned Data   Analytics

Technology Stack

  • Cloud Data Warehouse: Snowflake
  • Transformation Layer: dbt (Data Build Tool)
  • Cloud Storage: AWS S3 (implied)
  • Version Control: Git
  • Python: 3.12+
  • Key dbt Features:
    • Incremental models
    • Snapshots (SCD Type 2)
    • Custom macros
    • Jinja templating
    • Testing and documentation

πŸ“Š Data Model

Medallion Architecture

πŸ₯‰ Bronze Layer (Raw Data)

Raw data ingested from staging with minimal transformations:

  • bronze_bookings - Raw booking transactions
  • bronze_hosts - Raw host information
  • bronze_listings - Raw property listings

πŸ₯ˆ Silver Layer (Cleaned Data)

Cleaned and standardized data:

  • silver_bookings - Validated booking records
  • silver_hosts - Enhanced host profiles with quality metrics
  • silver_listings - Standardized listing information with price categorization

πŸ₯‡ Gold Layer (Analytics-Ready)

Business-ready datasets optimized for analytics:

  • obt (One Big Table) - Denormalized fact table joining bookings, listings, and hosts
  • fact - Fact table for dimensional modeling
  • Ephemeral models for intermediate transformations

Snapshots (SCD Type 2)

Slowly Changing Dimensions to track historical changes:

  • dim_bookings - Historical booking changes
  • dim_hosts - Historical host profile changes
  • dim_listings - Historical listing changes

πŸ“ Project Structure

AWS_DBT_Snowflake/
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ pyproject.toml                      # Python dependencies
β”œβ”€β”€ main.py                             # Main execution script
β”‚
β”œβ”€β”€ SourceData/                         # Raw CSV data files
β”‚   β”œβ”€β”€ bookings.csv
β”‚   β”œβ”€β”€ hosts.csv
β”‚   └── listings.csv
β”‚
β”œβ”€β”€ DDL/                                # Database schema definitions
β”‚   β”œβ”€β”€ ddl.sql                         # Table creation scripts
β”‚   └── resources.sql
β”‚
└── aws_dbt_snowflake_project/         # Main dbt project
    β”œβ”€β”€ dbt_project.yml                 # dbt project configuration
    β”œβ”€β”€ ExampleProfiles.yml             # Snowflake connection profile
    β”‚
    β”œβ”€β”€ models/                         # dbt models
    β”‚   β”œβ”€β”€ sources/
    β”‚   β”‚   └── sources.yml             # Source definitions
    β”‚   β”œβ”€β”€ bronze/                     # Raw data layer
    β”‚   β”‚   β”œβ”€β”€ bronze_bookings.sql
    β”‚   β”‚   β”œβ”€β”€ bronze_hosts.sql
    β”‚   β”‚   └── bronze_listings.sql
    β”‚   β”œβ”€β”€ silver/                     # Cleaned data layer
    β”‚   β”‚   β”œβ”€β”€ silver_bookings.sql
    β”‚   β”‚   β”œβ”€β”€ silver_hosts.sql
    β”‚   β”‚   └── silver_listings.sql
    β”‚   └── gold/                       # Analytics layer
    β”‚       β”œβ”€β”€ fact.sql
    β”‚       β”œβ”€β”€ obt.sql
    β”‚       └── ephemeral/              # Temporary models
    β”‚           β”œβ”€β”€ bookings.sql
    β”‚           β”œβ”€β”€ hosts.sql
    β”‚           └── listings.sql
    β”‚
    β”œβ”€β”€ macros/                         # Reusable SQL functions
    β”‚   β”œβ”€β”€ generate_schema_name.sql    # Custom schema naming
    β”‚   β”œβ”€β”€ multiply.sql                # Math operations
    β”‚   β”œβ”€β”€ tag.sql                     # Categorization logic
    β”‚   └── trimmer.sql                 # String utilities
    β”‚
    β”œβ”€β”€ analyses/                       # Ad-hoc analysis queries
    β”‚   β”œβ”€β”€ explore.sql
    β”‚   β”œβ”€β”€ if_else.sql
    β”‚   └── loop.sql
    β”‚
    β”œβ”€β”€ snapshots/                      # SCD Type 2 configurations
    β”‚   β”œβ”€β”€ dim_bookings.yml
    β”‚   β”œβ”€β”€ dim_hosts.yml
    β”‚   └── dim_listings.yml
    β”‚
    β”œβ”€β”€ tests/                          # Data quality tests
    β”‚   └── source_tests.sql
    β”‚
    └── seeds/                          # Static reference data

πŸš€ Getting Started

Prerequisites

  1. Snowflake Account (will create one if doesn't exist)

  2. Python Environment

    • Python 3.12 or higher
    • pip or uv package manager
  3. **AWS Account (will create one if doesn't exist) ** (for S3 storage)

Installation

  1. Clone the Repository

    git clone <repository-url>
    cd AWS_DBT_Snowflake
  2. Create Virtual Environment

    python -m venv .venv
    .venv\Scripts\Activate.ps1  # Windows PowerShell
    # or
    source .venv/bin/activate    # Linux/Mac
  3. Install Dependencies

    pip install -r requirements.txt
    # or using pyproject.toml
    pip install -e .

    Core Dependencies:

    • dbt-core>=1.11.2
    • dbt-snowflake>=1.11.0
    • sqlfmt>=0.0.3
  4. Configure Snowflake Connection

    Create ~/.dbt/profiles.yml:

    aws_dbt_snowflake_project:
      outputs:
        dev:
          account: <your-account-identifier>
          database: AIRBNB
          password: <your-password>
          role: ACCOUNTADMIN
          schema: dbt_schema
          threads: 4
          type: snowflake
          user: <your-username>
          warehouse: COMPUTE_WH
      target: dev
  5. Set Up Snowflake Database

    Run the DDL scripts to create tables:

    # Execute DDL/ddl.sql in Snowflake to create staging tables
  6. Load Source Data

    Load CSV files from SourceData/ to Snowflake staging schema:

    • bookings.csv β†’ AIRBNB.STAGING.BOOKINGS
    • hosts.csv β†’ AIRBNB.STAGING.HOSTS
    • listings.csv β†’ AIRBNB.STAGING.LISTINGS

πŸ”§ Usage

Running dbt Commands

  1. Test Connection

    cd aws_dbt_snowflake_project
    dbt debug
  2. Install Dependencies

    dbt deps
  3. Run All Models

    dbt run
  4. Run Specific Layer

    dbt run --select bronze.*      # Run bronze models only
    dbt run --select silver.*      # Run silver models only
    dbt run --select gold.*        # Run gold models only
  5. Run Tests

    dbt test
  6. Run Snapshots

    dbt snapshot
  7. Generate Documentation

    dbt docs generate
    dbt docs serve
  8. Build Everything

    dbt build  # Runs models, tests, and snapshots

🎯 Key Features

1. Incremental Loading

Bronze and silver models use incremental materialization to process only new/changed data:

{{ config(materialized='incremental') }}
{% if is_incremental() %}
    WHERE CREATED_AT > (SELECT COALESCE(MAX(CREATED_AT), '1900-01-01') FROM {{ this }})
{% endif %}

2. Custom Macros

Reusable business logic:

  • tag() macro: Categorizes prices into 'low', 'medium', 'high'
    {{ tag('CAST(PRICE_PER_NIGHT AS INT)') }} AS PRICE_PER_NIGHT_TAG

3. Dynamic SQL Generation

The OBT (One Big Table) model uses Jinja loops for maintainable joins:

{% set configs = [...] %}
SELECT {% for config in configs %}...{% endfor %}

4. Slowly Changing Dimensions

Track historical changes with timestamp-based snapshots:

  • Valid from/to dates automatically maintained
  • Historical data preserved for point-in-time analysis

5. Schema Organization

Automatic schema separation by layer:

  • Bronze models β†’ AIRBNB.BRONZE.*
  • Silver models β†’ AIRBNB.SILVER.*
  • Gold models β†’ AIRBNB.GOLD.*

πŸ“ˆ Data Quality

Testing Strategy

  • Source data validation tests
  • Unique key constraints
  • Not null checks
  • Referential integrity tests
  • Custom business rule tests

Data Lineage

dbt automatically tracks data lineage, showing:

  • Upstream dependencies
  • Downstream impacts
  • Model relationships
  • Source to consumption flow

πŸ” Security & Best Practices

  1. Credentials Management

    • Never commit profiles.yml with credentials
    • Use environment variables for sensitive data
    • Implement role-based access control (RBAC) in Snowflake
  2. Code Quality

    • SQL formatting with sqlfmt
    • Version control with Git
    • Code reviews for model changes
  3. Performance Optimization

    • Incremental models for large datasets
    • Ephemeral models for intermediate transformations
    • Appropriate clustering keys in Snowflake

πŸ“š Additional Resources

πŸ‘€ Author

Project: Airbnb Data Engineering Pipeline
Technologies: Snowflake, dbt, AWS, Python

πŸ› Troubleshooting

Common Issues

  1. Connection Error

    • Verify Snowflake credentials in profiles.yml
    • Check network connectivity
    • Ensure warehouse is running
  2. Compilation Error

    • Run dbt debug to check configuration
    • Verify model dependencies
    • Check Jinja syntax
  3. Incremental Load Issues

    • Run dbt run --full-refresh to rebuild from scratch
    • Verify source data timestamps

πŸ“Š Future Enhancements

  • Add data quality dashboards
  • Implement CI/CD pipeline
  • Add more complex business metrics
  • Integrate with BI tools (Tableau/Power BI)
  • Add alerting and monitoring
  • Implement data masking for PII
  • Add more comprehensive testing suite

About

This project implements a complete end-to-end data engineering pipeline for Airbnb data using modern cloud technologies. The solution demonstrates best practices in data warehousing, transformation, and analytics using Snowflake, dbt (Data Build Tool), and AWS.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors