🏠 Airbnb End-to-End Data Engineering Project

📋 Overview

This project implements a complete end-to-end data engineering pipeline for Airbnb data using modern cloud technologies. The solution demonstrates best practices in data warehousing, transformation, and analytics using Snowflake, dbt (Data Build Tool), and AWS.

The pipeline processes Airbnb listings, bookings, and hosts data through a medallion architecture (Bronze → Silver → Gold), implementing incremental loading, slowly changing dimensions (SCD Type 2), and creating analytics-ready datasets.

🏗️ Architecture

Data Flow

Source Data (CSV) → AWS S3 → Snowflake (Staging) → Bronze Layer → Silver Layer → Gold Layer
                                                           ↓              ↓           ↓
                                                      Raw Tables    Cleaned Data   Analytics

Technology Stack

Cloud Data Warehouse: Snowflake
Transformation Layer: dbt (Data Build Tool)
Cloud Storage: AWS S3 (implied)
Version Control: Git
Python: 3.12+
Key dbt Features:
- Incremental models
- Snapshots (SCD Type 2)
- Custom macros
- Jinja templating
- Testing and documentation

📊 Data Model

Medallion Architecture

🥉 Bronze Layer (Raw Data)

Raw data ingested from staging with minimal transformations:

bronze_bookings - Raw booking transactions
bronze_hosts - Raw host information
bronze_listings - Raw property listings

🥈 Silver Layer (Cleaned Data)

Cleaned and standardized data:

silver_bookings - Validated booking records
silver_hosts - Enhanced host profiles with quality metrics
silver_listings - Standardized listing information with price categorization

🥇 Gold Layer (Analytics-Ready)

Business-ready datasets optimized for analytics:

obt (One Big Table) - Denormalized fact table joining bookings, listings, and hosts
fact - Fact table for dimensional modeling
Ephemeral models for intermediate transformations

Snapshots (SCD Type 2)

Slowly Changing Dimensions to track historical changes:

dim_bookings - Historical booking changes
dim_hosts - Historical host profile changes
dim_listings - Historical listing changes

📁 Project Structure

AWS_DBT_Snowflake/
├── README.md                           # This file
├── pyproject.toml                      # Python dependencies
├── main.py                             # Main execution script
│
├── SourceData/                         # Raw CSV data files
│   ├── bookings.csv
│   ├── hosts.csv
│   └── listings.csv
│
├── DDL/                                # Database schema definitions
│   ├── ddl.sql                         # Table creation scripts
│   └── resources.sql
│
└── aws_dbt_snowflake_project/         # Main dbt project
    ├── dbt_project.yml                 # dbt project configuration
    ├── ExampleProfiles.yml             # Snowflake connection profile
    │
    ├── models/                         # dbt models
    │   ├── sources/
    │   │   └── sources.yml             # Source definitions
    │   ├── bronze/                     # Raw data layer
    │   │   ├── bronze_bookings.sql
    │   │   ├── bronze_hosts.sql
    │   │   └── bronze_listings.sql
    │   ├── silver/                     # Cleaned data layer
    │   │   ├── silver_bookings.sql
    │   │   ├── silver_hosts.sql
    │   │   └── silver_listings.sql
    │   └── gold/                       # Analytics layer
    │       ├── fact.sql
    │       ├── obt.sql
    │       └── ephemeral/              # Temporary models
    │           ├── bookings.sql
    │           ├── hosts.sql
    │           └── listings.sql
    │
    ├── macros/                         # Reusable SQL functions
    │   ├── generate_schema_name.sql    # Custom schema naming
    │   ├── multiply.sql                # Math operations
    │   ├── tag.sql                     # Categorization logic
    │   └── trimmer.sql                 # String utilities
    │
    ├── analyses/                       # Ad-hoc analysis queries
    │   ├── explore.sql
    │   ├── if_else.sql
    │   └── loop.sql
    │
    ├── snapshots/                      # SCD Type 2 configurations
    │   ├── dim_bookings.yml
    │   ├── dim_hosts.yml
    │   └── dim_listings.yml
    │
    ├── tests/                          # Data quality tests
    │   └── source_tests.sql
    │
    └── seeds/                          # Static reference data

🚀 Getting Started

Prerequisites

Snowflake Account (will create one if doesn't exist)
Python Environment
- Python 3.12 or higher
- pip or uv package manager
**AWS Account (will create one if doesn't exist) ** (for S3 storage)

Installation

Clone the Repository

git clone <repository-url>
cd AWS_DBT_Snowflake

Create Virtual Environment

python -m venv .venv
.venv\Scripts\Activate.ps1  # Windows PowerShell
# or
source .venv/bin/activate    # Linux/Mac

Install Dependencies
```
pip install -r requirements.txt
# or using pyproject.toml
pip install -e .
```
Core Dependencies:
- dbt-core>=1.11.2
- dbt-snowflake>=1.11.0
- sqlfmt>=0.0.3

Configure Snowflake Connection

Create ~/.dbt/profiles.yml:

aws_dbt_snowflake_project:
  outputs:
    dev:
      account: <your-account-identifier>
      database: AIRBNB
      password: <your-password>
      role: ACCOUNTADMIN
      schema: dbt_schema
      threads: 4
      type: snowflake
      user: <your-username>
      warehouse: COMPUTE_WH
  target: dev

Set Up Snowflake Database

Run the DDL scripts to create tables:
```
# Execute DDL/ddl.sql in Snowflake to create staging tables
```
Load Source Data

Load CSV files from SourceData/ to Snowflake staging schema:
- bookings.csv → AIRBNB.STAGING.BOOKINGS
- hosts.csv → AIRBNB.STAGING.HOSTS
- listings.csv → AIRBNB.STAGING.LISTINGS

🔧 Usage

Running dbt Commands

Test Connection
```
cd aws_dbt_snowflake_project
dbt debug
```
Install Dependencies
```
dbt deps
```
Run All Models
```
dbt run
```

Run Specific Layer

dbt run --select bronze.*      # Run bronze models only
dbt run --select silver.*      # Run silver models only
dbt run --select gold.*        # Run gold models only

Run Tests
```
dbt test
```
Run Snapshots
```
dbt snapshot
```
Generate Documentation
```
dbt docs generate
dbt docs serve
```

Build Everything

dbt build  # Runs models, tests, and snapshots

🎯 Key Features

1. Incremental Loading

Bronze and silver models use incremental materialization to process only new/changed data:

{{ config(materialized='incremental') }}
{% if is_incremental() %}
    WHERE CREATED_AT > (SELECT COALESCE(MAX(CREATED_AT), '1900-01-01') FROM {{ this }})
{% endif %}

2. Custom Macros

Reusable business logic:

tag() macro: Categorizes prices into 'low', 'medium', 'high'

{{ tag('CAST(PRICE_PER_NIGHT AS INT)') }} AS PRICE_PER_NIGHT_TAG

3. Dynamic SQL Generation

The OBT (One Big Table) model uses Jinja loops for maintainable joins:

{% set configs = [...] %}
SELECT {% for config in configs %}...{% endfor %}

4. Slowly Changing Dimensions

Track historical changes with timestamp-based snapshots:

Valid from/to dates automatically maintained
Historical data preserved for point-in-time analysis

5. Schema Organization

Automatic schema separation by layer:

Bronze models → AIRBNB.BRONZE.*
Silver models → AIRBNB.SILVER.*
Gold models → AIRBNB.GOLD.*

📈 Data Quality

Testing Strategy

Source data validation tests
Unique key constraints
Not null checks
Referential integrity tests
Custom business rule tests

Data Lineage

dbt automatically tracks data lineage, showing:

Upstream dependencies
Downstream impacts
Model relationships
Source to consumption flow

🔐 Security & Best Practices

Credentials Management
- Never commit profiles.yml with credentials
- Use environment variables for sensitive data
- Implement role-based access control (RBAC) in Snowflake
Code Quality
- SQL formatting with sqlfmt
- Version control with Git
- Code reviews for model changes
Performance Optimization
- Incremental models for large datasets
- Ephemeral models for intermediate transformations
- Appropriate clustering keys in Snowflake

📚 Additional Resources

dbt Documentation: https://docs.getdbt.com/
Snowflake Documentation: https://docs.snowflake.com/
dbt Best Practices: https://docs.getdbt.com/guides/best-practices

👤 Author

Project: Airbnb Data Engineering Pipeline
Technologies: Snowflake, dbt, AWS, Python

🐛 Troubleshooting

Common Issues

Connection Error
- Verify Snowflake credentials in profiles.yml
- Check network connectivity
- Ensure warehouse is running
Compilation Error
- Run dbt debug to check configuration
- Verify model dependencies
- Check Jinja syntax
Incremental Load Issues
- Run dbt run --full-refresh to rebuild from scratch
- Verify source data timestamps

📊 Future Enhancements

Add data quality dashboards
Implement CI/CD pipeline
Add more complex business metrics
Integrate with BI tools (Tableau/Power BI)
Add alerting and monitoring
Implement data masking for PII
Add more comprehensive testing suite

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
DDL		DDL
Notes		Notes
SourceData		SourceData
aws_dbt_snowflake_project		aws_dbt_snowflake_project
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

🏠 Airbnb End-to-End Data Engineering Project

📋 Overview

🏗️ Architecture

Data Flow

Technology Stack

📊 Data Model

Medallion Architecture

🥉 Bronze Layer (Raw Data)

🥈 Silver Layer (Cleaned Data)

🥇 Gold Layer (Analytics-Ready)

Snapshots (SCD Type 2)

📁 Project Structure

🚀 Getting Started

Prerequisites

Installation

🔧 Usage

Running dbt Commands

🎯 Key Features

1. Incremental Loading

2. Custom Macros

3. Dynamic SQL Generation

4. Slowly Changing Dimensions

5. Schema Organization

📈 Data Quality

Testing Strategy

Data Lineage

🔐 Security & Best Practices

📚 Additional Resources

👤 Author

🐛 Troubleshooting

Common Issues

📊 Future Enhancements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages