This project implements a complete end-to-end data engineering pipeline for Airbnb data using modern cloud technologies. The solution demonstrates best practices in data warehousing, transformation, and analytics using Snowflake, dbt (Data Build Tool), and AWS.
The pipeline processes Airbnb listings, bookings, and hosts data through a medallion architecture (Bronze β Silver β Gold), implementing incremental loading, slowly changing dimensions (SCD Type 2), and creating analytics-ready datasets.
Source Data (CSV) β AWS S3 β Snowflake (Staging) β Bronze Layer β Silver Layer β Gold Layer
β β β
Raw Tables Cleaned Data Analytics
- Cloud Data Warehouse: Snowflake
- Transformation Layer: dbt (Data Build Tool)
- Cloud Storage: AWS S3 (implied)
- Version Control: Git
- Python: 3.12+
- Key dbt Features:
- Incremental models
- Snapshots (SCD Type 2)
- Custom macros
- Jinja templating
- Testing and documentation
Raw data ingested from staging with minimal transformations:
bronze_bookings- Raw booking transactionsbronze_hosts- Raw host informationbronze_listings- Raw property listings
Cleaned and standardized data:
silver_bookings- Validated booking recordssilver_hosts- Enhanced host profiles with quality metricssilver_listings- Standardized listing information with price categorization
Business-ready datasets optimized for analytics:
obt(One Big Table) - Denormalized fact table joining bookings, listings, and hostsfact- Fact table for dimensional modeling- Ephemeral models for intermediate transformations
Slowly Changing Dimensions to track historical changes:
dim_bookings- Historical booking changesdim_hosts- Historical host profile changesdim_listings- Historical listing changes
AWS_DBT_Snowflake/
βββ README.md # This file
βββ pyproject.toml # Python dependencies
βββ main.py # Main execution script
β
βββ SourceData/ # Raw CSV data files
β βββ bookings.csv
β βββ hosts.csv
β βββ listings.csv
β
βββ DDL/ # Database schema definitions
β βββ ddl.sql # Table creation scripts
β βββ resources.sql
β
βββ aws_dbt_snowflake_project/ # Main dbt project
βββ dbt_project.yml # dbt project configuration
βββ ExampleProfiles.yml # Snowflake connection profile
β
βββ models/ # dbt models
β βββ sources/
β β βββ sources.yml # Source definitions
β βββ bronze/ # Raw data layer
β β βββ bronze_bookings.sql
β β βββ bronze_hosts.sql
β β βββ bronze_listings.sql
β βββ silver/ # Cleaned data layer
β β βββ silver_bookings.sql
β β βββ silver_hosts.sql
β β βββ silver_listings.sql
β βββ gold/ # Analytics layer
β βββ fact.sql
β βββ obt.sql
β βββ ephemeral/ # Temporary models
β βββ bookings.sql
β βββ hosts.sql
β βββ listings.sql
β
βββ macros/ # Reusable SQL functions
β βββ generate_schema_name.sql # Custom schema naming
β βββ multiply.sql # Math operations
β βββ tag.sql # Categorization logic
β βββ trimmer.sql # String utilities
β
βββ analyses/ # Ad-hoc analysis queries
β βββ explore.sql
β βββ if_else.sql
β βββ loop.sql
β
βββ snapshots/ # SCD Type 2 configurations
β βββ dim_bookings.yml
β βββ dim_hosts.yml
β βββ dim_listings.yml
β
βββ tests/ # Data quality tests
β βββ source_tests.sql
β
βββ seeds/ # Static reference data
-
Snowflake Account (will create one if doesn't exist)
-
Python Environment
- Python 3.12 or higher
- pip or uv package manager
-
**AWS Account (will create one if doesn't exist) ** (for S3 storage)
-
Clone the Repository
git clone <repository-url> cd AWS_DBT_Snowflake
-
Create Virtual Environment
python -m venv .venv .venv\Scripts\Activate.ps1 # Windows PowerShell # or source .venv/bin/activate # Linux/Mac
-
Install Dependencies
pip install -r requirements.txt # or using pyproject.toml pip install -e .
Core Dependencies:
dbt-core>=1.11.2dbt-snowflake>=1.11.0sqlfmt>=0.0.3
-
Configure Snowflake Connection
Create
~/.dbt/profiles.yml:aws_dbt_snowflake_project: outputs: dev: account: <your-account-identifier> database: AIRBNB password: <your-password> role: ACCOUNTADMIN schema: dbt_schema threads: 4 type: snowflake user: <your-username> warehouse: COMPUTE_WH target: dev
-
Set Up Snowflake Database
Run the DDL scripts to create tables:
# Execute DDL/ddl.sql in Snowflake to create staging tables -
Load Source Data
Load CSV files from
SourceData/to Snowflake staging schema:bookings.csvβAIRBNB.STAGING.BOOKINGShosts.csvβAIRBNB.STAGING.HOSTSlistings.csvβAIRBNB.STAGING.LISTINGS
-
Test Connection
cd aws_dbt_snowflake_project dbt debug -
Install Dependencies
dbt deps
-
Run All Models
dbt run
-
Run Specific Layer
dbt run --select bronze.* # Run bronze models only dbt run --select silver.* # Run silver models only dbt run --select gold.* # Run gold models only
-
Run Tests
dbt test -
Run Snapshots
dbt snapshot
-
Generate Documentation
dbt docs generate dbt docs serve
-
Build Everything
dbt build # Runs models, tests, and snapshots
Bronze and silver models use incremental materialization to process only new/changed data:
{{ config(materialized='incremental') }}
{% if is_incremental() %}
WHERE CREATED_AT > (SELECT COALESCE(MAX(CREATED_AT), '1900-01-01') FROM {{ this }})
{% endif %}Reusable business logic:
tag()macro: Categorizes prices into 'low', 'medium', 'high'{{ tag('CAST(PRICE_PER_NIGHT AS INT)') }} AS PRICE_PER_NIGHT_TAG
The OBT (One Big Table) model uses Jinja loops for maintainable joins:
{% set configs = [...] %}
SELECT {% for config in configs %}...{% endfor %}Track historical changes with timestamp-based snapshots:
- Valid from/to dates automatically maintained
- Historical data preserved for point-in-time analysis
Automatic schema separation by layer:
- Bronze models β
AIRBNB.BRONZE.* - Silver models β
AIRBNB.SILVER.* - Gold models β
AIRBNB.GOLD.*
- Source data validation tests
- Unique key constraints
- Not null checks
- Referential integrity tests
- Custom business rule tests
dbt automatically tracks data lineage, showing:
- Upstream dependencies
- Downstream impacts
- Model relationships
- Source to consumption flow
-
Credentials Management
- Never commit
profiles.ymlwith credentials - Use environment variables for sensitive data
- Implement role-based access control (RBAC) in Snowflake
- Never commit
-
Code Quality
- SQL formatting with
sqlfmt - Version control with Git
- Code reviews for model changes
- SQL formatting with
-
Performance Optimization
- Incremental models for large datasets
- Ephemeral models for intermediate transformations
- Appropriate clustering keys in Snowflake
- dbt Documentation: https://docs.getdbt.com/
- Snowflake Documentation: https://docs.snowflake.com/
- dbt Best Practices: https://docs.getdbt.com/guides/best-practices
Project: Airbnb Data Engineering Pipeline
Technologies: Snowflake, dbt, AWS, Python
-
Connection Error
- Verify Snowflake credentials in
profiles.yml - Check network connectivity
- Ensure warehouse is running
- Verify Snowflake credentials in
-
Compilation Error
- Run
dbt debugto check configuration - Verify model dependencies
- Check Jinja syntax
- Run
-
Incremental Load Issues
- Run
dbt run --full-refreshto rebuild from scratch - Verify source data timestamps
- Run
- Add data quality dashboards
- Implement CI/CD pipeline
- Add more complex business metrics
- Integrate with BI tools (Tableau/Power BI)
- Add alerting and monitoring
- Implement data masking for PII
- Add more comprehensive testing suite