cwida · ila · Apr 2, 2026 · Mar 23, 2026 · Mar 24, 2026 · Mar 24, 2026
diff --git a/.claude/settings.json b/.claude/settings.json
@@ -0,0 +1,17 @@
+{
+  "hooks": {
+    "PostToolUse": [
+      {
+        "matcher": "Edit",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "make format-fix 2>/dev/null || true",
+            "timeout": 30,
+            "statusMessage": "Running format-fix..."
+          }
+        ]
+      }
+    ]
+  }
+}
diff --git a/.claude/skills/explain-dp/SKILL.md b/.claude/skills/explain-dp/SKILL.md
@@ -0,0 +1,82 @@
+---
+name: explain-dp
+description: Reference material for differential privacy concepts. Auto-loaded when discussing privacy, attacks, sensitivity, or clipping.
+---
+
+## Differential Privacy (DP)
+
+### Definition
+
+A randomized mechanism M satisfies (ε,δ)-differential privacy if for all
+neighboring datasets D, D' (differing in one individual) and all outputs S:
+
+    P[M(D) ∈ S] ≤ e^ε · P[M(D') ∈ S] + δ
+
+Smaller ε = stronger privacy. δ is the probability of catastrophic failure.
+
+### Key concepts
+
+- **Sensitivity**: Maximum change in query output when one individual is
+  added/removed. For SUM with values in [L,U]: sensitivity = U-L.
+- **Laplace mechanism**: Add Laplace(0, sensitivity/ε) noise. Standard for counting queries.
+- **Gaussian mechanism**: Add N(0, sensitivity²·2ln(1.25/δ)/ε²) noise. Better for composition.
+- **Composition**: Running k queries on the same data costs k·ε total (basic),
+  or O(√k·ε) with advanced composition.
+- **Post-processing**: Any function of a DP output is still DP. Free to clip/transform after noise.
+
+### Membership Inference Attack (MIA)
+
+The adversary's game: given a query result, determine whether a specific individual
+is in the dataset. Attack accuracy = fraction of correct guesses across trials.
+50% = random (DP working). >50% = information leakage.
+
+### Bounded user contribution (Wilson et al. 2019)
+
+Standard approach for DP SQL:
+1. GROUP BY user_id → compute per-user contribution
+2. Clip each user's contribution to [L, U]
+3. Sum clipped contributions
+4. Add noise calibrated to U-L
+
+This handles both single-large-value outliers and many-small-values users.
+Reference: "Differentially Private SQL with Bounded User Contribution" (Google).
+
+### How PAC differs from DP
+
+| | DP | PAC |
+|---|---|---|
+| **Guarantee type** | Input-independent (worst-case) | Instance-dependent (distribution D) |
+| **Noise calibration** | Sensitivity s → noise ∝ s/ε | Variance σ² → noise ∝ σ²/(2β) |
+| **White-boxing** | Required (analyze algorithm) | Not needed (black-box simulation) |
+| **Composition** | k queries → k·ε (basic) | k queries → Σ MIᵢ (linear, Theorem 2) |
+| **Privacy metric** | ε (log-likelihood ratio) | MI (mutual information, in nats) |
+| **Conversion** | MI=1/128 ≈ ε=0.25 for prior=50% | See Table 3.2 in thesis |
+| **Stable algorithms** | Same noise regardless | Less noise automatically |
+| **Outlier impact** | Sensitivity explodes | Variance explodes (same practical problem) |
+
+Key insight: PAC guarantees are **loose** — the theoretical bound on MIA success
+rate is conservative. Empirical attacks achieve lower success than the bound
+predicts. This means the bounds are hard to violate.
+
+### Input clipping (Winsorization)
+
+Clip individual values to [μ-tσ, μ+tσ] before aggregation. Reduces sensitivity.
+Well-established in DP literature. Limitations: doesn't catch users with many
+small values (need per-user contribution clipping instead).
+
+### Privacy-conscious design
+
+Rather than post-hoc privatization (build algorithm, then add noise), PAC enables
+**privacy-conscious design**: optimize algorithm parameters jointly with
+the privacy budget.
+
+Key result: For a privatized estimator with budget B:
+    MSE = Bias² + (1/(2B) + 1) · Var + error
+
+This means privatization amplifies the variance by 1/(2B). At tight budgets
+(small B), the optimal algorithm shifts toward lower-variance (higher-bias)
+models. E.g., stronger regularization in ridge regression.
+
+For databases: this suggests that queries producing high-variance outputs (due to
+outliers, small groups, etc.) are inherently harder to privatize. Clipping reduces
+variance and thus the noise needed, improving the privacy-utility tradeoff.
diff --git a/.claude/skills/explain-pac-ddl/SKILL.md b/.claude/skills/explain-pac-ddl/SKILL.md
@@ -0,0 +1,108 @@
+---
+name: explain-pac-ddl
+description: Reference for PAC DDL syntax — PAC_KEY, PAC_LINK, PROTECTED, SET PU, and the parser. Auto-loaded when discussing table setup, privacy units, or protected columns.
+---
+
+## PAC DDL Overview
+
+PAC extends SQL DDL with privacy annotations. The parser (`src/parser/pac_parser.cpp`,
+`src/parser/pac_parser_helpers.cpp`) intercepts CREATE TABLE and ALTER TABLE statements
+to extract PAC-specific clauses before forwarding to DuckDB.
+
+### Privacy Unit (PU) table
+
+The PU table is the entity being protected (e.g., customer). One row = one individual.
+
+```sql
+-- Mark a table as the privacy unit
+ALTER TABLE customer ADD PAC_KEY (c_custkey);
+ALTER TABLE customer SET PU;
+
+-- Protect specific columns from direct projection
+ALTER PU TABLE customer ADD PROTECTED (c_acctbal, c_name, c_address);
+```
+
+- `PAC_KEY (col)`: Designates the column(s) that uniquely identify a privacy unit.
+  Must be set before `SET PU`.
+- `SET PU`: Marks the table as the privacy unit. After this, aggregates on linked
+  tables get PAC noise.
+- `PROTECTED (col1, col2, ...)`: Columns that cannot be directly projected.
+  Aggregates (SUM, COUNT, AVG) on protected columns go through PAC.
+
+### Linking tables to the PU
+
+Non-PU tables reference the PU table via foreign-key-like links:
+
+```sql
+ALTER TABLE orders ADD PAC_LINK (o_custkey) REFERENCES customer (c_custkey);
+ALTER TABLE lineitem ADD PAC_LINK (l_orderkey) REFERENCES orders (o_orderkey);
+```
+
+- `PAC_LINK (local_col) REFERENCES table(ref_col)`: Declares how to join this
+  table back to the PU. The compiler uses these links to inject the PU hash
+  into the query plan.
+- Links can be chained: `lineitem → orders → customer`.
+
+### CREATE TABLE syntax (inline)
+
+PAC clauses can be inlined in CREATE TABLE:
+
+```sql
+CREATE PU TABLE employees (
+    id INTEGER,
+    department VARCHAR,
+    salary DECIMAL(10,2),
+    PAC_KEY (id),
+    PROTECTED (salary)
+);
+```
+
+The parser strips PAC_KEY, PAC_LINK, and PROTECTED clauses from the CREATE
+statement, forwards the clean SQL to DuckDB, then applies the PAC metadata
+via ALTER TABLE internally.
+
+### Common mistakes
+
+- `PAC_LINK(col, table, ref)` — wrong. Use `PAC_LINK (col) REFERENCES table(ref)`.
+- `PROTECTED salary` — wrong. Must have parentheses: `PROTECTED (salary)`.
+- ALTER TABLE on a PU table requires `ALTER PU TABLE`, not `ALTER TABLE`.
+
+### Metadata files
+
+PAC metadata (PU tables, links, protected columns) is stored in JSON sidecar files
+next to the database file. The naming convention is:
+
+```
+pac_metadata_<dbname>_<schema>.json
+```
+
+For example, `tpch_sf1.db` produces `pac_metadata_tpch_sf1_main.json` in the same
+directory.
+
+**Auto-loading**: When the PAC extension loads (`LOAD pac`), it automatically looks
+for a matching metadata file next to the attached database and loads it. No manual
+`PRAGMA load_pac_metadata` needed for persistent databases.
+
+**Saving**: After setting up PAC_KEY/PAC_LINK/PROTECTED, save with:
+```sql
+PRAGMA save_pac_metadata('/path/to/pac_metadata_mydb_main.json');
+```
+
+**Clearing**: Reset all in-memory PAC metadata:
+```sql
+PRAGMA clear_pac_metadata;
+```
+
+**Important**: If you delete or recreate a database file, also delete the
+corresponding `pac_metadata_*.json` file. Stale metadata causes confusing errors
+(references to tables/columns that no longer exist).
+
+For in-memory databases, metadata file is named `pac_metadata_memory_main.json`
+in the current working directory.
+
+### Key source files
+
+- `src/parser/pac_parser.cpp` — main parser hook (intercepts SQL statements)
+- `src/parser/pac_parser_helpers.cpp` — extraction of PAC_KEY, PAC_LINK, PROTECTED
+- `src/core/pac_metadata.cpp` — in-memory metadata storage for PU/link/protected info
+- `src/core/pac_extension.cpp` — auto-loading of metadata on extension load (LoadInternal)
diff --git a/.claude/skills/explain-pac/SKILL.md b/.claude/skills/explain-pac/SKILL.md
@@ -0,0 +1,98 @@
+---
+name: explain-pac
+description: Reference material for PAC privacy internals. Auto-loaded when discussing PAC mechanism, noise, counters, or clipping.
+---
+
+## PAC Privacy Overview
+
+PAC (Probably Approximately Correct) privacy is a framework for privatizing
+algorithms with provable guarantees, described in [SIMD-PAC-DB](https://arxiv.org/abs/2603.15023).
+
+### Formal definition
+
+Given a data distribution D, a query Q satisfies (δ, ρ, D)-PAC Privacy if no
+adversary who knows D can, after observing Q(X) where X ~ D, produce an
+estimate X̂ such that ρ(X̂, X) = 1 with probability ≥ (1-δ).
+
+The key insight: **noise scales with the variance of the algorithm's output across
+random subsamples** of the data. Stable algorithms (low variance) need less noise.
+
+### The 4-step privatization template
+
+1. **Subsample**: Draw m random 50%-subsets X₁...Xₘ from the full dataset
+2. **Compute**: Run the query Q on each subset → outputs y₁...yₘ
+3. **Estimate noise**: Compute variance σ² across the yᵢ. Required noise: Δ = σ²/(2β)
+   where β is the MI budget
+4. **Release**: Pick a random subset Xⱼ, return Q(Xⱼ) + N(0, Δ)
+
+This is the theoretical foundation. SIMD-PAC-DB encodes this efficiently using
+64 parallel counters (one per possible subset assignment bit).
+
+### MI → posterior success rate
+
+| MI | Max posterior (prior=50%) | Max posterior (prior=25%) |
+|----|--------------------------|--------------------------|
+| 1/128 | 56.2% | 30.5% |
+| 1/64 | 58.8% | 32.9% |
+| 1/32 | 62.4% | 36.3% |
+| 1/16 | 67.5% | 41.2% |
+| 1/8 | 74.5% | 48.2% |
+| 1/4 | 83.8% | 58.4% |
+| 1/2 | 95.2% | 72.7% |
+| 1 | 100% | 91.4% |
+
+### PAC Composition
+
+For T adaptive queries with independent random sampling per query, the total
+MI is bounded by the sum: MI(total) ≤ Σᵢ MIᵢ. This is linear composition —
+each query's MI adds to the budget. The key requirement: **independent random
+sampling per query** (each query uses a fresh random subset).
+
+### PAC vs DP
+
+- **DP**: input-independent guarantee. Requires white-boxing to compute sensitivity.
+  Noise ∝ sensitivity/ε. Works for worst-case neighboring datasets.
+- **PAC**: instance-dependent guarantee. No white-boxing needed. Noise ∝ Var[Q(X)]/β.
+  Stable queries get less noise automatically. But the guarantee depends on the
+  data distribution D.
+
+### Core mechanism (SIMD-PAC-DB implementation)
+
+- Each aggregate maintains **64 parallel counters** (one per bit of a hashed key)
+- Each row's value is added to ~32 counters (determined by pac_hash of the PU key)
+- At finalization, noise calibrated to a **mutual information bound** (pac_mi) is
+  added, and the result is estimated from the counters
+- PAC does NOT compute sensitivity (unlike differential privacy)
+- The 64 counters encode m=64 possible subsets in one pass (SIMD-efficient)
+
+### SWAR bitslice encoding
+
+- Counters are packed as 4 × uint16_t per uint64_t (SWAR = SIMD Within A Register)
+- This enables processing 4 counters per instruction without actual SIMD intrinsics
+- Overflow cascades to 32-bit overflow counters when 16-bit counters saturate
+
+### pac_clip_sum (contribution clipping)
+
+- **Pre-aggregation**: Query rewriter inserts `GROUP BY pu_hash` to sum each user's
+  rows into a single contribution (handles the "50K small items" case)
+- **Magnitude levels**: Values decomposed into levels (4x per level, 2-bit shift).
+  Level 0: 0-255, Level 1: 256-1023, Level 2: 1024-4095, etc.
+- **Bitmap tracking**: Each level maintains a 64-bit bitmap of distinct contributors
+  (using birthday-paradox estimation from popcount)
+- **Hard-zero**: Levels with fewer distinct contributors than `pac_clip_support`
+  contribute nothing to the result (prevents variance side-channel attacks)
+
+### Key settings
+
+- `pac_mi`: Mutual information bound (0 = deterministic/no noise)
+- `pac_seed`: RNG seed for reproducible noise
+- `pac_clip_support`: Minimum distinct contributors per magnitude level (NULL = disabled)
+- `pac_hash_repair`: Ensure pac_hash outputs exactly 32 bits set
+
+### DDL
+
+```sql
+ALTER TABLE customer ADD PAC_KEY (c_custkey);
+ALTER TABLE customer SET PU;
+ALTER TABLE orders ADD PAC_LINK (o_custkey) REFERENCES customer (c_custkey);
+```
diff --git a/.claude/skills/run-attacks/SKILL.md b/.claude/skills/run-attacks/SKILL.md
@@ -0,0 +1,23 @@
+---
+name: run-attacks
+description: Run the pac_clip_sum membership inference attack test suite and summarize results.
+---
+
+## Context
+
+PAC (Probably Approximately Correct) privacy privatizes SQL aggregates via 64 parallel
+SWAR bitslice counters with MI-bounded noise. pac_clip_sum adds per-user contribution
+clipping using magnitude-level decomposition (4x bands, 2-bit shift) with distinct-contributor
+bitmaps. Unsupported outlier levels are hard-zeroed to prevent variance side-channel attacks.
+
+## Instructions
+
+1. Build if needed: `GEN=ninja make 2>&1 | tail -5`
+2. Run the main attack suite: `bash attacks/clip_attack_test.sh 2>/dev/null`
+3. Run the multi-row attack: `bash attacks/clip_multirow_test.sh 2>/dev/null`
+4. Run stress tests if available: `bash attacks/clip_hardzero_stress.sh 2>/dev/null`
+
+Summarize results as a table:
+- Attack scenario, clip_support value, attack accuracy, std_in, std_out, std ratio
+- Flag any accuracy above 60% as a potential regression
+- Compare to baselines in `attacks/clip_attack_results.md`
diff --git a/.claude/skills/shared b/.claude/skills/shared
diff --git a/.claude/skills/test-clip/SKILL.md b/.claude/skills/test-clip/SKILL.md
@@ -0,0 +1,11 @@
+---
+name: test-clip
+description: Build and run pac_clip_sum unit tests.
+---
+
+## Instructions
+
+1. Build: `GEN=ninja make 2>&1 | tail -5`
+2. Run clip_sum tests: `build/release/test/unittest "test/sql/pac_clip_sum*" 2>&1`
+3. Report: number of assertions passed/failed
+4. If any fail, show the failing test name and expected vs actual values
diff --git a/.gitmodules b/.gitmodules
@@ -9,3 +9,6 @@
 [submodule "benchmark/sqlstorm/SQLStorm"]
 	path = benchmark/sqlstorm/SQLStorm
 	url = https://github.com/SQL-Storm/SQLStorm.git
+[submodule ".claude/skills/shared"]
+	path = .claude/skills/shared
+	url = https://github.com/ila/duckdb-claude-skills.git