Skip to content

rosettadb/dbtstudio_duckdb_s3_analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Source, Local-First Analytics Engineering using DBT Studio

You don't need a warehouse to do analytics engineering on S3 data.

If your data starts life as Parquet or CSV files in S3, spinning up a full "warehouse + orchestration + BI stack" is usually unnecessary friction: too much infrastructure, slow setup, and high cost for a simple problem.

A local-first analytics stack — DuckDB + dbt — is often the cleanest solution. It lets you go from raw files to trusted, business-ready models in minutes, not weeks, without running any servers.

This project demonstrates a small-but-real analytics pipeline that proves this approach works in practice:

S3 files → DuckDB (local) → dbt layers (staging → intermediate) → business models → back to S3

TL;DR (What You'll Build)

  • Input: s3://rosetta-dbt-studio-demo/raw/device_events/*.parquet
  • Engine: DuckDB (local file)
  • dbt layers: staging → business → export
  • Models you'll produce:
    • daily_device_metrics (daily averages + WARN/FAIL counts)
    • daily_device_health (GREEN/YELLOW/RED health flag)
  • Output back to S3: gold/device_events/daily_device_health.parquet
  • Time: ~15 minutes

If your team has raw Parquet or CSV files in S3 and needs trusted business tables quickly, this workflow removes most of the usual friction. You can query files directly from S3 using DuckDB locally, iterate fast with dbt, and export curated marts back to S3 — without standing up a warehouse.

This approach is ideal for:

  • Small teams
  • Prototyping
  • Demos
  • CI runs where you want reproducible dbt builds without heavy infrastructure

Why DuckDB + dbt is a Great Fit

DuckDB is ideal when you want fast analytics without running a server. It runs locally, it's extremely quick on Parquet, and it's perfect for "data files → models" workflows.

dbt adds the analytics engineering discipline: repeatable models, clear layering, tests, docs, and lineage — so this doesn't turn into a pile of ad-hoc SQL.

Rosetta DBT Studio provides the ability to integrate all of these together with a desktop GUI that makes dbt Core workflows feel like an IDE.

Project Structure

This project follows a classic layered pattern:

models/
├── ingest/          # Load raw files from S3 into DuckDB tables
│   └── device_events.sql
├── raw/             # Source definitions (sources.yml)
│   └── main_device_events.yaml
├── staging/         # 1:1 cleanup (types, naming, basic rules)
│   └── main_device_events.sql
├── business/        # Business models / marts
│   ├── daily_device_metrics.sql
│   └── daily_device_health.sql
└── export/          # Write transformed data back to S3
    └── daily_device_health__parquet.sql

This structure makes your project maintainable as it grows — without needing a heavy platform upfront.

What We're Building (Concrete Example)

We have IoT device measurements stored as Parquet in S3:

  • Source: s3://rosetta-dbt-studio-demo/raw/device_events/*.parquet
  • Schema: Each row contains device_id, ts, temperature_c, humidity_pct, status

Our goal is to generate two business models:

  1. daily_device_metrics — daily averages + WARN/FAIL counts per device
  2. daily_device_health — a daily health flag (GREEN/YELLOW/RED) per device

Prerequisites

  • Python 3.8+
  • dbt-core with DuckDB adapter
  • AWS credentials configured (for S3 access)
  • Rosetta DBT Studio (optional, but recommended for the best experience)

Setup

1. Configure AWS Credentials

Set your AWS credentials as environment variables:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_SESSION_TOKEN=your_session_token  # Optional, for temporary credentials

Or configure them in your profiles.yml (recommended for production).

2. Configure dbt Profile

The project uses a local DuckDB file at rosetta/iot.duckdb. Update profiles.yml if needed:

config:
  send_anonymous_usage_stats: false
  partial_parse: true
dbtstudio_duckdb_s3_analytics:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: rosetta/iot.duckdb
      schema: main
      threads: 4

3. Update S3 Paths (If Needed)

If you're using your own S3 bucket, update the paths in:

  • models/ingest/device_events.sql (input path)
  • models/export/daily_device_health__parquet.sql (output path)

Running the Pipeline

Using Rosetta DBT Studio

  1. Open Rosetta DBT Studio
  2. Create a new connection to the DuckDB file (rosetta/iot.duckdb)
  3. Open this project
  4. Run models from the UI:
    • Run ingest.device_events to load data from S3
    • Run staging.main_device_events to create the staging layer
    • Run business.daily_device_metrics and business.daily_device_health to create business models
    • Run export.daily_device_health__parquet to export back to S3

Model Documentation

Ingest Layer

models/ingest/device_events.sql

  • Loads raw Parquet files directly from S3 into a DuckDB table
  • Uses DuckDB's read_parquet() function with S3 paths
  • Installs and loads required extensions (httpfs, parquet)

Staging Layer

models/staging/main_device_events.sql

  • Cleans and standardizes the raw data
  • References the ingested table via {{ source('main', 'device_events') }}
  • Ensures consistent column names and types for downstream models

Business Layer

models/business/daily_device_metrics.sql

  • Aggregates device events to daily metrics
  • Calculates:
    • Average temperature per device per day
    • Average humidity per device per day
    • Count of FAIL events per device per day
    • Count of WARN events per device per day

models/business/daily_device_health.sql

  • Builds on daily_device_metrics to assign health flags
  • Health logic:
    • RED: fail_count > 3
    • YELLOW: warn_count > 10 (and fail_count ≤ 3)
    • GREEN: Otherwise

Export Layer

models/export/daily_device_health__parquet.sql

  • Exports the curated business model back to S3 as Parquet
  • Uses DuckDB's external table feature
  • Output location: s3://rosetta-dbt-studio-demo/gold/device_events/daily_device_health.parquet

Example Queries

Once your models are built, you can query them directly in DuckDB or use them in BI tools:

-- Daily average temperature and humidity for all devices
SELECT
    event_date,
    AVG(avg_temperature_c) AS overall_avg_temperature,
    AVG(avg_humidity_pct) AS overall_avg_humidity
FROM daily_device_health
GROUP BY event_date
ORDER BY event_date;

-- Count of devices by health flag per day
SELECT
    event_date,
    health_flag,
    COUNT(device_id) AS device_count
FROM daily_device_health
GROUP BY event_date, health_flag
ORDER BY event_date, health_flag;

-- Top 5 devices with highest fail counts
SELECT
    device_id,
    SUM(fail_count) AS total_fail_count
FROM daily_device_health
GROUP BY device_id
ORDER BY total_fail_count DESC
LIMIT 5;

When This Approach is Ideal

This local-first pipeline shines when:

  • You're prototyping new datasets
  • You want fast iteration without standing up infrastructure
  • You're a small data team (or even one analyst/engineer)
  • You want reproducible outputs and dbt discipline with minimal ops

Once you have many concurrent BI users, multi-TB data, or strict governance needs, you'll likely move execution into a shared warehouse/lakehouse. The good news: your dbt models and layer structure usually transfer cleanly.

Benefits

This approach gives teams real, practical ROI:

  • Lower cost to start: No always-on warehouse required just to model files in S3
  • Faster time-to-insight: Local DuckDB is fast; engineers iterate quickly and ship business models sooner
  • Reduced operational overhead: Fewer moving parts than a full cloud analytics stack
  • More standardization: dbt layering + tests creates maintainable "analytics code", not one-off SQL
  • Portability: Outputs as Parquet back to S3 means other tools can consume them (Spark, Athena, DuckDB, Trino, etc.)
  • Great for demos & enablement: Excellent for onboarding analysts/engineers and proving value before scaling

For many teams, this becomes the "bootstrap path": start local-first, prove the models, then scale execution later if needed.

Troubleshooting

S3 Access Issues

  • Verify AWS credentials are set correctly
  • Check S3 bucket permissions
  • Ensure the S3 region matches your bucket region (default: eu-central-1)

Path Issues

  • Verify the DuckDB file path in profiles.yml exists or can be created
  • Check S3 paths are correct and accessible

Conclusion

With DuckDB + dbt, you can turn S3 files into clean, testable business models without immediately committing to a full warehouse stack. Add Rosetta DBT Studio, and the workflow becomes even more accessible: create models, run them, inspect results, and iterate quickly from a desktop IDE.

The result is a lightweight, open-source pipeline that still follows strong analytics engineering practices:

  • Layered models
  • Reproducible builds
  • Business-ready outputs
  • An easy path to export curated data back to S3

If your team is trying to move from "raw files" to "trusted business models" quickly — this local-first approach is one of the fastest ways to get there.

License

This project is open source and available for use and modification.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors