PostgreSQL 15 | Cloud SQL | Database:
govbrnews
Partially normalized schema with 5 tables:
- Master data:
agencies,themes(normalized) - Main table:
news(FKs + denormalized fields for performance) - Auxiliary:
sync_log,schema_version
Government agencies master data (~158 records).
CREATE TABLE agencies (
id SERIAL PRIMARY KEY,
key VARCHAR(100) UNIQUE NOT NULL,
name VARCHAR(500) NOT NULL,
type VARCHAR(100),
parent_key VARCHAR(100),
url VARCHAR(1000),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);Indexes:
idx_agencies_keyonkeyidx_agencies_parentonparent_key
Key fields:
key: Unique identifier (e.g., "mec", "saude")parent_key: Self-reference to parent agency
Hierarchical taxonomy with 3 levels (L1 → L2 → L3).
CREATE TABLE themes (
id SERIAL PRIMARY KEY,
code VARCHAR(20) UNIQUE NOT NULL,
label VARCHAR(500) NOT NULL,
full_name VARCHAR(600),
level SMALLINT NOT NULL CHECK (level IN (1, 2, 3)),
parent_code VARCHAR(20),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
CONSTRAINT fk_parent_theme
FOREIGN KEY (parent_code) REFERENCES themes(code) ON DELETE SET NULL
);Indexes:
idx_themes_codeoncodeidx_themes_levelonlevelidx_themes_parentonparent_code
Key fields:
code: Hierarchical code (e.g., "01.01.01")level: 1 (top), 2 (mid), or 3 (leaf)parent_code: Self-reference for hierarchy
Main news storage (~300k records). Partially normalized with FKs to agencies/themes + denormalized fields for performance.
CREATE TABLE news (
id SERIAL PRIMARY KEY,
unique_id VARCHAR(32) UNIQUE NOT NULL,
-- Foreign keys
agency_id INTEGER NOT NULL REFERENCES agencies(id),
theme_l1_id INTEGER REFERENCES themes(id),
theme_l2_id INTEGER REFERENCES themes(id),
theme_l3_id INTEGER REFERENCES themes(id),
most_specific_theme_id INTEGER REFERENCES themes(id),
-- Core content
title TEXT NOT NULL,
url TEXT,
image_url TEXT,
category VARCHAR(500),
tags TEXT[],
content TEXT,
editorial_lead TEXT,
subtitle TEXT,
-- AI-generated
summary TEXT,
-- Timestamps
published_at TIMESTAMP WITH TIME ZONE NOT NULL,
updated_datetime TIMESTAMP WITH TIME ZONE,
extracted_at TIMESTAMP WITH TIME ZONE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
synced_to_hf_at TIMESTAMP WITH TIME ZONE,
-- Denormalized (performance)
agency_key VARCHAR(100),
agency_name VARCHAR(500)
);Primary indexes:
idx_news_unique_id(UNIQUE) onunique_ididx_news_published_atonpublished_at DESCidx_news_agency_idonagency_ididx_news_most_specific_themeonmost_specific_theme_id
Performance indexes:
idx_news_agency_keyonagency_key(denormalized)idx_news_agency_dateon(agency_id, published_at DESC)idx_news_synced_to_hf(partial) onsynced_to_hf_at WHERE synced_to_hf_at IS NULL
Full-text search:
idx_news_fts(GIN) onto_tsvector('portuguese', title || ' ' || content)
Key fields:
unique_id: MD5(agency + published_at + title)most_specific_theme_id: Most granular theme (L3 > L2 > L1)synced_to_hf_at: Last sync to HuggingFace Datasetagency_key,agency_name: Denormalized from agencies for performance
Tracks sync operations (HuggingFace, Typesense, etc.).
CREATE TABLE sync_log (
id SERIAL PRIMARY KEY,
operation VARCHAR(50) NOT NULL,
status VARCHAR(20) NOT NULL,
records_processed INTEGER DEFAULT 0,
records_failed INTEGER DEFAULT 0,
started_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
completed_at TIMESTAMP WITH TIME ZONE,
error_message TEXT,
metadata JSONB
);Indexes:
idx_sync_log_operationon(operation, started_at DESC)
Key fields:
operation: e.g., "hf_export", "typesense_index"status: "started", "completed", "failed"metadata: JSONB for additional operation data
Tracks schema migrations.
CREATE TABLE schema_version (
version VARCHAR(20) PRIMARY KEY,
applied_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
description TEXT
);Current version: 1.0 (initial schema)
CREATE TRIGGER update_news_updated_at
BEFORE UPDATE ON news
FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();
CREATE TRIGGER update_agencies_updated_at
BEFORE UPDATE ON agencies
FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();Automatically populates agency_key and agency_name in news table from agencies.
CREATE TRIGGER denormalize_news_agency
BEFORE INSERT OR UPDATE OF agency_id ON news
FOR EACH ROW EXECUTE FUNCTION denormalize_agency_info();News with denormalized theme hierarchy for easy querying.
CREATE VIEW news_with_themes AS
SELECT
n.id,
n.unique_id,
n.title,
n.url,
n.agency_name,
n.published_at,
n.summary,
t1.label as theme_l1,
t2.label as theme_l2,
t3.label as theme_l3,
COALESCE(t3.label, t2.label, t1.label) as most_specific_theme
FROM news n
LEFT JOIN themes t1 ON n.theme_l1_id = t1.id
LEFT JOIN themes t2 ON n.theme_l2_id = t2.id
LEFT JOIN themes t3 ON n.theme_l3_id = t3.id;Last 100 sync operations with duration.
CREATE VIEW recent_syncs AS
SELECT
operation,
status,
records_processed,
records_failed,
started_at,
completed_at,
EXTRACT(EPOCH FROM (completed_at - started_at)) as duration_seconds,
error_message
FROM sync_log
ORDER BY started_at DESC
LIMIT 100;┌──────────────┐ ┌───────────────┐
│ agencies │ │ themes │
│──────────────│ │───────────────│
│ id (PK) │ │ id (PK) │
│ key (UNIQUE) │◄─┐ │ code (UNIQUE) │◄─┐
│ name │ │ │ label │ │
│ type │ │ │ level (1-3) │ │
│ parent_key │ │ │ parent_code │──┘
└──────────────┘ │ └───────────────┘
│ ▲ ▲ ▲ ▲
│ │ │ │ │
│ ┌─────┘ │ │ └─────┐
│ │ │ │ │
┌─────┴──────┴────────┴──┴────────┴─────┐
│ news │
│───────────────────────────────────────│
│ id (PK) │
│ unique_id (UNIQUE) │
│ agency_id (FK) │
│ theme_l1_id, theme_l2_id, │
│ theme_l3_id, most_specific_theme_id │
│ title, content, summary, ... │
│ agency_key, agency_name (denorm) │
│ published_at, synced_to_hf_at │
└───────────────────────────────────────┘
SELECT * FROM news
WHERE agency_key = 'mec'
AND published_at >= NOW() - INTERVAL '30 days'
ORDER BY published_at DESC;SELECT * FROM news_with_themes
WHERE published_at >= '2024-01-01'
ORDER BY published_at DESC
LIMIT 100;SELECT title, published_at, agency_name
FROM news
WHERE to_tsvector('portuguese', title || ' ' || COALESCE(content, ''))
@@ to_tsquery('portuguese', 'educação')
ORDER BY published_at DESC;SELECT COUNT(*) FROM news WHERE synced_to_hf_at IS NULL;The news table denormalizes agency_key and agency_name to avoid JOINs on high-frequency queries. This is auto-maintained by the denormalize_news_agency trigger.
- Date queries: Use
idx_news_published_at(DESC for recent-first) - Agency filtering: Use
idx_news_agency_key(denormalized, faster than JOIN) - Theme filtering: Use
idx_news_most_specific_theme - Composite: Use
idx_news_agency_datefor agency + date range
idx_news_synced_to_hf: Only indexes records wheresynced_to_hf_at IS NULLto speed up sync job queries
See also: