docs: add BigQuery errors table schema and update README

prosdev · prosdev · commit dc87f6b1decd · 2026-01-14T18:22:43.000-08:00
Add BigQuery schema:
- Create errors table with partitioning and clustering
- Metadata table for tracking loaded files
- Example queries for debugging

Update README:
- Error Store (Dead Letter Queue) section
- Storage structure and usage examples
- BigQuery setup and query examples
- Updated roadmap (error store complete)

Features documented:
- Two error types (validation vs processing)
- 30-day retention with GCS lifecycle
- Full event context for debugging
- BigQuery integration for analysis
diff --git a/README.md b/README.md
@@ -340,6 +340,62 @@ python -m scripts.run_bigquery_loader
 
 See `scripts/bigquery/README.md` and `specs/gcs-bigquery-storage/` for full details.
 
+### Error Store (Dead Letter Queue)
+
+All failed events are stored in a GCS-based dead letter queue for debugging and retry:
+
+**Two Error Types:**
+- **Validation Errors**: Missing required fields, invalid schema
+- **Processing Errors**: Storage failures, unexpected exceptions
+
+**Storage Structure:**
+```
+gs://bucket/errors/
+  date=2026-01-15/
+    error_type=validation/
+      error-20260115-100000-abc123.parquet
+    error_type=processing/
+      error-20260115-100500-def456.parquet
+```
+
+**Create BigQuery Errors Table:**
+```bash
+cd scripts/bigquery
+export PROJECT_ID=my-project DATASET=events
+cat create_errors_table.sql | sed "s/{PROJECT_ID}/$PROJECT_ID/g" | sed "s/{DATASET}/$DATASET/g" | bq query --use_legacy_sql=false
+```
+
+**Query Errors:**
+```sql
+-- Find validation errors in last 24 hours
+SELECT 
+  error_message,
+  stream,
+  COUNT(*) as count
+FROM `project.dataset.errors`
+WHERE date >= CURRENT_DATE() - 1
+  AND error_type = 'validation_error'
+GROUP BY error_message, stream
+ORDER BY count DESC;
+
+-- Get processing errors with stack traces
+SELECT 
+  timestamp,
+  error_message,
+  JSON_EXTRACT_SCALAR(error_details, '$.exception_type') as exception,
+  JSON_EXTRACT_SCALAR(error_details, '$.stack_trace') as stack_trace
+FROM `project.dataset.errors`
+WHERE error_type = 'processing_error'
+ORDER BY timestamp DESC
+LIMIT 10;
+```
+
+**Key Features:**
+- Never loses events - all failures stored for debugging
+- Automatic 30-day retention (GCS lifecycle rules)
+- Full event context (payload, error, timestamp, stream)
+- Queryable via BigQuery for pattern analysis
+
 ### Custom Storage
 
 Implement the `EventStore` protocol for any backend:
@@ -486,7 +542,7 @@ uv run ruff format src/
 - [x] Prometheus metrics
 - [x] EventSubscriptionCoordinator (dual-path architecture)
 - [x] Hash-based sequencer for consistent ordering
-- [ ] Error handling and dead letter queue (ErrorStore protocol exists, needs implementation)
+- [x] Error store with dead letter queue (GCS-based)
 - [ ] Performance benchmarks (10k+ events/sec)
 
 ### v1.0
diff --git a/scripts/bigquery/create_errors_table.sql b/scripts/bigquery/create_errors_table.sql
@@ -0,0 +1,74 @@
+-- Create errors table for dead letter queue
+--
+-- Usage:
+--   export PROJECT_ID=your-project
+--   export DATASET=events
+--   cat create_errors_table.sql | sed "s/{PROJECT_ID}/$PROJECT_ID/g" | sed "s/{DATASET}/$DATASET/g" | bq query --use_legacy_sql=false
+
+CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET}.errors` (
+  error_id STRING NOT NULL,
+  timestamp TIMESTAMP NOT NULL,
+  error_type STRING NOT NULL,  -- validation_error | processing_error
+  error_message STRING NOT NULL,
+  error_details JSON,
+  stream STRING NOT NULL,
+  original_payload JSON NOT NULL,
+  retry_count INT64 DEFAULT 0,
+  retry_after TIMESTAMP,
+  date DATE NOT NULL
+)
+PARTITION BY date
+CLUSTER BY error_type, stream
+OPTIONS(
+  description="Dead letter queue for failed events. All events that fail validation or processing are stored here with full context for debugging.",
+  partition_expiration_days=30
+);
+
+-- Create metadata table for tracking loaded error files
+CREATE TABLE IF NOT EXISTS `{PROJECT_ID}.{DATASET}._loaded_error_files` (
+  file_path STRING NOT NULL,
+  loaded_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP(),
+  row_count INT64,
+  error_type STRING
+)
+PARTITION BY DATE(loaded_at)
+OPTIONS(
+  description="Metadata tracking which error files have been loaded to prevent duplicates"
+);
+
+-- Example queries for debugging
+
+-- Find validation errors in last 24 hours
+-- SELECT 
+--   error_type,
+--   error_message,
+--   stream,
+--   COUNT(*) as count
+-- FROM `{PROJECT_ID}.{DATASET}.errors`
+-- WHERE date >= CURRENT_DATE() - 1
+--   AND error_type = 'validation_error'
+-- GROUP BY error_type, error_message, stream
+-- ORDER BY count DESC;
+
+-- Find processing errors with stack traces
+-- SELECT 
+--   timestamp,
+--   error_message,
+--   JSON_EXTRACT_SCALAR(error_details, '$.exception_type') as exception,
+--   JSON_EXTRACT_SCALAR(error_details, '$.stack_trace') as stack_trace,
+--   original_payload
+-- FROM `{PROJECT_ID}.{DATASET}.errors`
+-- WHERE error_type = 'processing_error'
+-- ORDER BY timestamp DESC
+-- LIMIT 10;
+
+-- Error rate by stream
+-- SELECT 
+--   stream,
+--   error_type,
+--   COUNT(*) as error_count,
+--   ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY stream), 2) as pct
+-- FROM `{PROJECT_ID}.{DATASET}.errors`
+-- WHERE date >= CURRENT_DATE() - 7
+-- GROUP BY stream, error_type
+-- ORDER BY stream, error_count DESC;