You don't need a warehouse to do analytics engineering on S3 data.
If your data starts life as Parquet or CSV files in S3, spinning up a full "warehouse + orchestration + BI stack" is usually unnecessary friction: too much infrastructure, slow setup, and high cost for a simple problem.
A local-first analytics stack — DuckDB + dbt — is often the cleanest solution. It lets you go from raw files to trusted, business-ready models in minutes, not weeks, without running any servers.
This project demonstrates a small-but-real analytics pipeline that proves this approach works in practice:
S3 files → DuckDB (local) → dbt layers (staging → intermediate) → business models → back to S3
- Input:
s3://rosetta-dbt-studio-demo/raw/device_events/*.parquet - Engine: DuckDB (local file)
- dbt layers: staging → business → export
- Models you'll produce:
daily_device_metrics(daily averages + WARN/FAIL counts)daily_device_health(GREEN/YELLOW/RED health flag)
- Output back to S3:
gold/device_events/daily_device_health.parquet - Time: ~15 minutes
If your team has raw Parquet or CSV files in S3 and needs trusted business tables quickly, this workflow removes most of the usual friction. You can query files directly from S3 using DuckDB locally, iterate fast with dbt, and export curated marts back to S3 — without standing up a warehouse.
This approach is ideal for:
- Small teams
- Prototyping
- Demos
- CI runs where you want reproducible dbt builds without heavy infrastructure
DuckDB is ideal when you want fast analytics without running a server. It runs locally, it's extremely quick on Parquet, and it's perfect for "data files → models" workflows.
dbt adds the analytics engineering discipline: repeatable models, clear layering, tests, docs, and lineage — so this doesn't turn into a pile of ad-hoc SQL.
Rosetta DBT Studio provides the ability to integrate all of these together with a desktop GUI that makes dbt Core workflows feel like an IDE.
This project follows a classic layered pattern:
models/
├── ingest/ # Load raw files from S3 into DuckDB tables
│ └── device_events.sql
├── raw/ # Source definitions (sources.yml)
│ └── main_device_events.yaml
├── staging/ # 1:1 cleanup (types, naming, basic rules)
│ └── main_device_events.sql
├── business/ # Business models / marts
│ ├── daily_device_metrics.sql
│ └── daily_device_health.sql
└── export/ # Write transformed data back to S3
└── daily_device_health__parquet.sql
This structure makes your project maintainable as it grows — without needing a heavy platform upfront.
We have IoT device measurements stored as Parquet in S3:
- Source:
s3://rosetta-dbt-studio-demo/raw/device_events/*.parquet - Schema: Each row contains
device_id,ts,temperature_c,humidity_pct,status
Our goal is to generate two business models:
daily_device_metrics— daily averages + WARN/FAIL counts per devicedaily_device_health— a daily health flag (GREEN/YELLOW/RED) per device
- Python 3.8+
- dbt-core with DuckDB adapter
- AWS credentials configured (for S3 access)
- Rosetta DBT Studio (optional, but recommended for the best experience)
Set your AWS credentials as environment variables:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_SESSION_TOKEN=your_session_token # Optional, for temporary credentialsOr configure them in your profiles.yml (recommended for production).
The project uses a local DuckDB file at rosetta/iot.duckdb. Update profiles.yml if needed:
config:
send_anonymous_usage_stats: false
partial_parse: true
dbtstudio_duckdb_s3_analytics:
target: dev
outputs:
dev:
type: duckdb
path: rosetta/iot.duckdb
schema: main
threads: 4If you're using your own S3 bucket, update the paths in:
models/ingest/device_events.sql(input path)models/export/daily_device_health__parquet.sql(output path)
- Open Rosetta DBT Studio
- Create a new connection to the DuckDB file (
rosetta/iot.duckdb) - Open this project
- Run models from the UI:
- Run
ingest.device_eventsto load data from S3 - Run
staging.main_device_eventsto create the staging layer - Run
business.daily_device_metricsandbusiness.daily_device_healthto create business models - Run
export.daily_device_health__parquetto export back to S3
- Run
models/ingest/device_events.sql
- Loads raw Parquet files directly from S3 into a DuckDB table
- Uses DuckDB's
read_parquet()function with S3 paths - Installs and loads required extensions (httpfs, parquet)
models/staging/main_device_events.sql
- Cleans and standardizes the raw data
- References the ingested table via
{{ source('main', 'device_events') }} - Ensures consistent column names and types for downstream models
models/business/daily_device_metrics.sql
- Aggregates device events to daily metrics
- Calculates:
- Average temperature per device per day
- Average humidity per device per day
- Count of FAIL events per device per day
- Count of WARN events per device per day
models/business/daily_device_health.sql
- Builds on
daily_device_metricsto assign health flags - Health logic:
- RED:
fail_count > 3 - YELLOW:
warn_count > 10(and fail_count ≤ 3) - GREEN: Otherwise
- RED:
models/export/daily_device_health__parquet.sql
- Exports the curated business model back to S3 as Parquet
- Uses DuckDB's external table feature
- Output location:
s3://rosetta-dbt-studio-demo/gold/device_events/daily_device_health.parquet
Once your models are built, you can query them directly in DuckDB or use them in BI tools:
-- Daily average temperature and humidity for all devices
SELECT
event_date,
AVG(avg_temperature_c) AS overall_avg_temperature,
AVG(avg_humidity_pct) AS overall_avg_humidity
FROM daily_device_health
GROUP BY event_date
ORDER BY event_date;
-- Count of devices by health flag per day
SELECT
event_date,
health_flag,
COUNT(device_id) AS device_count
FROM daily_device_health
GROUP BY event_date, health_flag
ORDER BY event_date, health_flag;
-- Top 5 devices with highest fail counts
SELECT
device_id,
SUM(fail_count) AS total_fail_count
FROM daily_device_health
GROUP BY device_id
ORDER BY total_fail_count DESC
LIMIT 5;This local-first pipeline shines when:
- You're prototyping new datasets
- You want fast iteration without standing up infrastructure
- You're a small data team (or even one analyst/engineer)
- You want reproducible outputs and dbt discipline with minimal ops
Once you have many concurrent BI users, multi-TB data, or strict governance needs, you'll likely move execution into a shared warehouse/lakehouse. The good news: your dbt models and layer structure usually transfer cleanly.
This approach gives teams real, practical ROI:
- Lower cost to start: No always-on warehouse required just to model files in S3
- Faster time-to-insight: Local DuckDB is fast; engineers iterate quickly and ship business models sooner
- Reduced operational overhead: Fewer moving parts than a full cloud analytics stack
- More standardization: dbt layering + tests creates maintainable "analytics code", not one-off SQL
- Portability: Outputs as Parquet back to S3 means other tools can consume them (Spark, Athena, DuckDB, Trino, etc.)
- Great for demos & enablement: Excellent for onboarding analysts/engineers and proving value before scaling
For many teams, this becomes the "bootstrap path": start local-first, prove the models, then scale execution later if needed.
- Verify AWS credentials are set correctly
- Check S3 bucket permissions
- Ensure the S3 region matches your bucket region (default:
eu-central-1)
- Verify the DuckDB file path in
profiles.ymlexists or can be created - Check S3 paths are correct and accessible
With DuckDB + dbt, you can turn S3 files into clean, testable business models without immediately committing to a full warehouse stack. Add Rosetta DBT Studio, and the workflow becomes even more accessible: create models, run them, inspect results, and iterate quickly from a desktop IDE.
The result is a lightweight, open-source pipeline that still follows strong analytics engineering practices:
- Layered models
- Reproducible builds
- Business-ready outputs
- An easy path to export curated data back to S3
If your team is trying to move from "raw files" to "trusted business models" quickly — this local-first approach is one of the fastest ways to get there.
This project is open source and available for use and modification.
Contributions are welcome! Please feel free to submit a Pull Request.