Open Source, Local-First Analytics Engineering using DBT Studio

You don't need a warehouse to do analytics engineering on S3 data.

If your data starts life as Parquet or CSV files in S3, spinning up a full "warehouse + orchestration + BI stack" is usually unnecessary friction: too much infrastructure, slow setup, and high cost for a simple problem.

A local-first analytics stack — DuckDB + dbt — is often the cleanest solution. It lets you go from raw files to trusted, business-ready models in minutes, not weeks, without running any servers.

This project demonstrates a small-but-real analytics pipeline that proves this approach works in practice:

S3 files → DuckDB (local) → dbt layers (staging → intermediate) → business models → back to S3

TL;DR (What You'll Build)

Input: s3://rosetta-dbt-studio-demo/raw/device_events/*.parquet
Engine: DuckDB (local file)
dbt layers: staging → business → export
Models you'll produce:
- daily_device_metrics (daily averages + WARN/FAIL counts)
- daily_device_health (GREEN/YELLOW/RED health flag)
Output back to S3: gold/device_events/daily_device_health.parquet
Time: ~15 minutes

If your team has raw Parquet or CSV files in S3 and needs trusted business tables quickly, this workflow removes most of the usual friction. You can query files directly from S3 using DuckDB locally, iterate fast with dbt, and export curated marts back to S3 — without standing up a warehouse.

This approach is ideal for:

Small teams
Prototyping
Demos
CI runs where you want reproducible dbt builds without heavy infrastructure

Why DuckDB + dbt is a Great Fit

DuckDB is ideal when you want fast analytics without running a server. It runs locally, it's extremely quick on Parquet, and it's perfect for "data files → models" workflows.

dbt adds the analytics engineering discipline: repeatable models, clear layering, tests, docs, and lineage — so this doesn't turn into a pile of ad-hoc SQL.

Rosetta DBT Studio provides the ability to integrate all of these together with a desktop GUI that makes dbt Core workflows feel like an IDE.

Project Structure

This project follows a classic layered pattern:

models/
├── ingest/          # Load raw files from S3 into DuckDB tables
│   └── device_events.sql
├── raw/             # Source definitions (sources.yml)
│   └── main_device_events.yaml
├── staging/         # 1:1 cleanup (types, naming, basic rules)
│   └── main_device_events.sql
├── business/        # Business models / marts
│   ├── daily_device_metrics.sql
│   └── daily_device_health.sql
└── export/          # Write transformed data back to S3
    └── daily_device_health__parquet.sql

This structure makes your project maintainable as it grows — without needing a heavy platform upfront.

What We're Building (Concrete Example)

We have IoT device measurements stored as Parquet in S3:

Source: s3://rosetta-dbt-studio-demo/raw/device_events/*.parquet
Schema: Each row contains device_id, ts, temperature_c, humidity_pct, status

Our goal is to generate two business models:

daily_device_metrics — daily averages + WARN/FAIL counts per device
daily_device_health — a daily health flag (GREEN/YELLOW/RED) per device

Prerequisites

Python 3.8+
dbt-core with DuckDB adapter
AWS credentials configured (for S3 access)
Rosetta DBT Studio (optional, but recommended for the best experience)

Setup

1. Configure AWS Credentials

Set your AWS credentials as environment variables:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_SESSION_TOKEN=your_session_token  # Optional, for temporary credentials

Or configure them in your profiles.yml (recommended for production).

2. Configure dbt Profile

The project uses a local DuckDB file at rosetta/iot.duckdb. Update profiles.yml if needed:

config:
  send_anonymous_usage_stats: false
  partial_parse: true
dbtstudio_duckdb_s3_analytics:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: rosetta/iot.duckdb
      schema: main
      threads: 4

3. Update S3 Paths (If Needed)

If you're using your own S3 bucket, update the paths in:

models/ingest/device_events.sql (input path)
models/export/daily_device_health__parquet.sql (output path)

Running the Pipeline

Using Rosetta DBT Studio

Open Rosetta DBT Studio
Create a new connection to the DuckDB file (rosetta/iot.duckdb)
Open this project
Run models from the UI:
- Run ingest.device_events to load data from S3
- Run staging.main_device_events to create the staging layer
- Run business.daily_device_metrics and business.daily_device_health to create business models
- Run export.daily_device_health__parquet to export back to S3

Model Documentation

Ingest Layer

models/ingest/device_events.sql

Loads raw Parquet files directly from S3 into a DuckDB table
Uses DuckDB's read_parquet() function with S3 paths
Installs and loads required extensions (httpfs, parquet)

Staging Layer

models/staging/main_device_events.sql

Cleans and standardizes the raw data
References the ingested table via {{ source('main', 'device_events') }}
Ensures consistent column names and types for downstream models

Business Layer

models/business/daily_device_metrics.sql

Aggregates device events to daily metrics
Calculates:
- Average temperature per device per day
- Average humidity per device per day
- Count of FAIL events per device per day
- Count of WARN events per device per day

models/business/daily_device_health.sql

Builds on daily_device_metrics to assign health flags
Health logic:
- RED: fail_count > 3
- YELLOW: warn_count > 10 (and fail_count ≤ 3)
- GREEN: Otherwise

Export Layer

models/export/daily_device_health__parquet.sql

Exports the curated business model back to S3 as Parquet
Uses DuckDB's external table feature
Output location: s3://rosetta-dbt-studio-demo/gold/device_events/daily_device_health.parquet

Example Queries

Once your models are built, you can query them directly in DuckDB or use them in BI tools:

-- Daily average temperature and humidity for all devices
SELECT
    event_date,
    AVG(avg_temperature_c) AS overall_avg_temperature,
    AVG(avg_humidity_pct) AS overall_avg_humidity
FROM daily_device_health
GROUP BY event_date
ORDER BY event_date;

-- Count of devices by health flag per day
SELECT
    event_date,
    health_flag,
    COUNT(device_id) AS device_count
FROM daily_device_health
GROUP BY event_date, health_flag
ORDER BY event_date, health_flag;

-- Top 5 devices with highest fail counts
SELECT
    device_id,
    SUM(fail_count) AS total_fail_count
FROM daily_device_health
GROUP BY device_id
ORDER BY total_fail_count DESC
LIMIT 5;

When This Approach is Ideal

This local-first pipeline shines when:

You're prototyping new datasets
You want fast iteration without standing up infrastructure
You're a small data team (or even one analyst/engineer)
You want reproducible outputs and dbt discipline with minimal ops

Once you have many concurrent BI users, multi-TB data, or strict governance needs, you'll likely move execution into a shared warehouse/lakehouse. The good news: your dbt models and layer structure usually transfer cleanly.

Benefits

This approach gives teams real, practical ROI:

Lower cost to start: No always-on warehouse required just to model files in S3
Faster time-to-insight: Local DuckDB is fast; engineers iterate quickly and ship business models sooner
Reduced operational overhead: Fewer moving parts than a full cloud analytics stack
More standardization: dbt layering + tests creates maintainable "analytics code", not one-off SQL
Portability: Outputs as Parquet back to S3 means other tools can consume them (Spark, Athena, DuckDB, Trino, etc.)
Great for demos & enablement: Excellent for onboarding analysts/engineers and proving value before scaling

For many teams, this becomes the "bootstrap path": start local-first, prove the models, then scale execution later if needed.

Troubleshooting

S3 Access Issues

Verify AWS credentials are set correctly
Check S3 bucket permissions
Ensure the S3 region matches your bucket region (default: eu-central-1)

Path Issues

Verify the DuckDB file path in profiles.yml exists or can be created
Check S3 paths are correct and accessible

Conclusion

With DuckDB + dbt, you can turn S3 files into clean, testable business models without immediately committing to a full warehouse stack. Add Rosetta DBT Studio, and the workflow becomes even more accessible: create models, run them, inspect results, and iterate quickly from a desktop IDE.

The result is a lightweight, open-source pipeline that still follows strong analytics engineering practices:

Layered models
Reproducible builds
Business-ready outputs
An easy path to export curated data back to S3

If your team is trying to move from "raw files" to "trusted business models" quickly — this local-first approach is one of the fastest ways to get there.

License

This project is open source and available for use and modification.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
logs		logs
models		models
rosetta		rosetta
seeds		seeds
snapshots		snapshots
target		target
tests		tests
README.md		README.md
dbt_project.yml		dbt_project.yml
profiles.yml		profiles.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Source, Local-First Analytics Engineering using DBT Studio

TL;DR (What You'll Build)

Why DuckDB + dbt is a Great Fit

Project Structure

What We're Building (Concrete Example)

Prerequisites

Setup

1. Configure AWS Credentials

2. Configure dbt Profile

3. Update S3 Paths (If Needed)

Running the Pipeline

Using Rosetta DBT Studio

Model Documentation

Ingest Layer

Staging Layer

Business Layer

Export Layer

Example Queries

When This Approach is Ideal

Benefits

Troubleshooting

S3 Access Issues

Path Issues

Conclusion

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Open Source, Local-First Analytics Engineering using DBT Studio

TL;DR (What You'll Build)

Why DuckDB + dbt is a Great Fit

Project Structure

What We're Building (Concrete Example)

Prerequisites

Setup

1. Configure AWS Credentials

2. Configure dbt Profile

3. Update S3 Paths (If Needed)

Running the Pipeline

Using Rosetta DBT Studio

Model Documentation

Ingest Layer

Staging Layer

Business Layer

Export Layer

Example Queries

When This Approach is Ideal

Benefits

Troubleshooting

S3 Access Issues

Path Issues

Conclusion

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages