Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
948a504
Add pac_clip_sum aggregate with clipping support
peterboncz Mar 23, 2026
96da519
instructions for genai
peterboncz Mar 24, 2026
5067f46
Hard-zero unsupported outlier levels in pac_clip_sum
ila Mar 24, 2026
a377f07
Reduce pac_clip_sum level width from 16x to 4x (shift=4 → shift=2)
ila Mar 24, 2026
f87f9f8
Fix pac_clip_sum test: adjust soft-clamp assertion for hard-zero beha…
ila Mar 24, 2026
160bff1
Extend pac_clip_sum to 62 levels for full HUGEINT support
ila Mar 24, 2026
3d60b74
Add tests for level boundaries, HUGEINT clipping, over-clipping, mult…
ila Mar 24, 2026
a1422de
Add CLAUDE.md with development rules and project guidance
ila Mar 24, 2026
864a696
Add attack scripts and evaluation results for pac_clip_sum
ila Mar 24, 2026
d5d2243
Update CLAUDE.md, add hooks, skills, and permissions
ila Mar 24, 2026
cf1cebe
Merge branch 'pac_clip' of github.com:cwida/pac into pac_clip
peterboncz Mar 24, 2026
524e461
Add pac_clip_sum aggregate with clipping support
peterboncz Mar 23, 2026
c848241
Hard-zero unsupported outlier levels in pac_clip_sum
ila Mar 24, 2026
9d6bb04
Reduce pac_clip_sum level width from 16x to 4x (shift=4 → shift=2)
ila Mar 24, 2026
4aa4e8f
Fix pac_clip_sum test: adjust soft-clamp assertion for hard-zero beha…
ila Mar 24, 2026
ce56963
Extend pac_clip_sum to 62 levels for full HUGEINT support
ila Mar 24, 2026
cd38190
Add tests for level boundaries, HUGEINT clipping, over-clipping, mult…
ila Mar 24, 2026
051dc4a
Add CLAUDE.md with development rules and project guidance
ila Mar 24, 2026
3188709
Add attack scripts and evaluation results for pac_clip_sum
ila Mar 24, 2026
8d7b944
Update CLAUDE.md, add hooks, skills, and permissions
ila Mar 24, 2026
783008c
Add metadata file documentation to explain-pac-ddl skill
ila Mar 24, 2026
f8e192b
Update PAC and DP skills with formal definitions and theory
ila Mar 25, 2026
7e7688f
Add pac_clip_min_max and float/double support for clip aggregates
ila Apr 1, 2026
ee63b59
Add shared Claude Code skills submodule
ila Apr 1, 2026
ef0da79
Merge branch 'pac_clip' of github.com:cwida/pac into pac_clip
peterboncz Apr 2, 2026
b0c3f01
(I made some git mess and some of the below changes where part of the…
peterboncz Apr 2, 2026
5135553
rename PAC2_ into CLIP_
peterboncz Apr 2, 2026
00beaa3
make format-fix
peterboncz Apr 2, 2026
bf4f404
more attacks
ila Apr 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .claude/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"hooks": {
"PostToolUse": [
{
"matcher": "Edit",
"hooks": [
{
"type": "command",
"command": "make format-fix 2>/dev/null || true",
"timeout": 30,
"statusMessage": "Running format-fix..."
}
]
}
]
}
}
82 changes: 82 additions & 0 deletions .claude/skills/explain-dp/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
name: explain-dp
description: Reference material for differential privacy concepts. Auto-loaded when discussing privacy, attacks, sensitivity, or clipping.
---

## Differential Privacy (DP)

### Definition

A randomized mechanism M satisfies (ε,δ)-differential privacy if for all
neighboring datasets D, D' (differing in one individual) and all outputs S:

P[M(D) ∈ S] ≤ e^ε · P[M(D') ∈ S] + δ

Smaller ε = stronger privacy. δ is the probability of catastrophic failure.

### Key concepts

- **Sensitivity**: Maximum change in query output when one individual is
added/removed. For SUM with values in [L,U]: sensitivity = U-L.
- **Laplace mechanism**: Add Laplace(0, sensitivity/ε) noise. Standard for counting queries.
- **Gaussian mechanism**: Add N(0, sensitivity²·2ln(1.25/δ)/ε²) noise. Better for composition.
- **Composition**: Running k queries on the same data costs k·ε total (basic),
or O(√k·ε) with advanced composition.
- **Post-processing**: Any function of a DP output is still DP. Free to clip/transform after noise.

### Membership Inference Attack (MIA)

The adversary's game: given a query result, determine whether a specific individual
is in the dataset. Attack accuracy = fraction of correct guesses across trials.
50% = random (DP working). >50% = information leakage.

### Bounded user contribution (Wilson et al. 2019)

Standard approach for DP SQL:
1. GROUP BY user_id → compute per-user contribution
2. Clip each user's contribution to [L, U]
3. Sum clipped contributions
4. Add noise calibrated to U-L

This handles both single-large-value outliers and many-small-values users.
Reference: "Differentially Private SQL with Bounded User Contribution" (Google).

### How PAC differs from DP

| | DP | PAC |
|---|---|---|
| **Guarantee type** | Input-independent (worst-case) | Instance-dependent (distribution D) |
| **Noise calibration** | Sensitivity s → noise ∝ s/ε | Variance σ² → noise ∝ σ²/(2β) |
| **White-boxing** | Required (analyze algorithm) | Not needed (black-box simulation) |
| **Composition** | k queries → k·ε (basic) | k queries → Σ MIᵢ (linear, Theorem 2) |
| **Privacy metric** | ε (log-likelihood ratio) | MI (mutual information, in nats) |
| **Conversion** | MI=1/128 ≈ ε=0.25 for prior=50% | See Table 3.2 in thesis |
| **Stable algorithms** | Same noise regardless | Less noise automatically |
| **Outlier impact** | Sensitivity explodes | Variance explodes (same practical problem) |

Key insight: PAC guarantees are **loose** — the theoretical bound on MIA success
rate is conservative. Empirical attacks achieve lower success than the bound
predicts. This means the bounds are hard to violate.

### Input clipping (Winsorization)

Clip individual values to [μ-tσ, μ+tσ] before aggregation. Reduces sensitivity.
Well-established in DP literature. Limitations: doesn't catch users with many
small values (need per-user contribution clipping instead).

### Privacy-conscious design

Rather than post-hoc privatization (build algorithm, then add noise), PAC enables
**privacy-conscious design**: optimize algorithm parameters jointly with
the privacy budget.

Key result: For a privatized estimator with budget B:
MSE = Bias² + (1/(2B) + 1) · Var + error

This means privatization amplifies the variance by 1/(2B). At tight budgets
(small B), the optimal algorithm shifts toward lower-variance (higher-bias)
models. E.g., stronger regularization in ridge regression.

For databases: this suggests that queries producing high-variance outputs (due to
outliers, small groups, etc.) are inherently harder to privatize. Clipping reduces
variance and thus the noise needed, improving the privacy-utility tradeoff.
108 changes: 108 additions & 0 deletions .claude/skills/explain-pac-ddl/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
name: explain-pac-ddl
description: Reference for PAC DDL syntax — PAC_KEY, PAC_LINK, PROTECTED, SET PU, and the parser. Auto-loaded when discussing table setup, privacy units, or protected columns.
---

## PAC DDL Overview

PAC extends SQL DDL with privacy annotations. The parser (`src/parser/pac_parser.cpp`,
`src/parser/pac_parser_helpers.cpp`) intercepts CREATE TABLE and ALTER TABLE statements
to extract PAC-specific clauses before forwarding to DuckDB.

### Privacy Unit (PU) table

The PU table is the entity being protected (e.g., customer). One row = one individual.

```sql
-- Mark a table as the privacy unit
ALTER TABLE customer ADD PAC_KEY (c_custkey);
ALTER TABLE customer SET PU;

-- Protect specific columns from direct projection
ALTER PU TABLE customer ADD PROTECTED (c_acctbal, c_name, c_address);
```

- `PAC_KEY (col)`: Designates the column(s) that uniquely identify a privacy unit.
Must be set before `SET PU`.
- `SET PU`: Marks the table as the privacy unit. After this, aggregates on linked
tables get PAC noise.
- `PROTECTED (col1, col2, ...)`: Columns that cannot be directly projected.
Aggregates (SUM, COUNT, AVG) on protected columns go through PAC.

### Linking tables to the PU

Non-PU tables reference the PU table via foreign-key-like links:

```sql
ALTER TABLE orders ADD PAC_LINK (o_custkey) REFERENCES customer (c_custkey);
ALTER TABLE lineitem ADD PAC_LINK (l_orderkey) REFERENCES orders (o_orderkey);
```

- `PAC_LINK (local_col) REFERENCES table(ref_col)`: Declares how to join this
table back to the PU. The compiler uses these links to inject the PU hash
into the query plan.
- Links can be chained: `lineitem → orders → customer`.

### CREATE TABLE syntax (inline)

PAC clauses can be inlined in CREATE TABLE:

```sql
CREATE PU TABLE employees (
id INTEGER,
department VARCHAR,
salary DECIMAL(10,2),
PAC_KEY (id),
PROTECTED (salary)
);
```

The parser strips PAC_KEY, PAC_LINK, and PROTECTED clauses from the CREATE
statement, forwards the clean SQL to DuckDB, then applies the PAC metadata
via ALTER TABLE internally.

### Common mistakes

- `PAC_LINK(col, table, ref)` — wrong. Use `PAC_LINK (col) REFERENCES table(ref)`.
- `PROTECTED salary` — wrong. Must have parentheses: `PROTECTED (salary)`.
- ALTER TABLE on a PU table requires `ALTER PU TABLE`, not `ALTER TABLE`.

### Metadata files

PAC metadata (PU tables, links, protected columns) is stored in JSON sidecar files
next to the database file. The naming convention is:

```
pac_metadata_<dbname>_<schema>.json
```

For example, `tpch_sf1.db` produces `pac_metadata_tpch_sf1_main.json` in the same
directory.

**Auto-loading**: When the PAC extension loads (`LOAD pac`), it automatically looks
for a matching metadata file next to the attached database and loads it. No manual
`PRAGMA load_pac_metadata` needed for persistent databases.

**Saving**: After setting up PAC_KEY/PAC_LINK/PROTECTED, save with:
```sql
PRAGMA save_pac_metadata('/path/to/pac_metadata_mydb_main.json');
```

**Clearing**: Reset all in-memory PAC metadata:
```sql
PRAGMA clear_pac_metadata;
```

**Important**: If you delete or recreate a database file, also delete the
corresponding `pac_metadata_*.json` file. Stale metadata causes confusing errors
(references to tables/columns that no longer exist).

For in-memory databases, metadata file is named `pac_metadata_memory_main.json`
in the current working directory.

### Key source files

- `src/parser/pac_parser.cpp` — main parser hook (intercepts SQL statements)
- `src/parser/pac_parser_helpers.cpp` — extraction of PAC_KEY, PAC_LINK, PROTECTED
- `src/core/pac_metadata.cpp` — in-memory metadata storage for PU/link/protected info
- `src/core/pac_extension.cpp` — auto-loading of metadata on extension load (LoadInternal)
98 changes: 98 additions & 0 deletions .claude/skills/explain-pac/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
name: explain-pac
description: Reference material for PAC privacy internals. Auto-loaded when discussing PAC mechanism, noise, counters, or clipping.
---

## PAC Privacy Overview

PAC (Probably Approximately Correct) privacy is a framework for privatizing
algorithms with provable guarantees, described in [SIMD-PAC-DB](https://arxiv.org/abs/2603.15023).

### Formal definition

Given a data distribution D, a query Q satisfies (δ, ρ, D)-PAC Privacy if no
adversary who knows D can, after observing Q(X) where X ~ D, produce an
estimate X̂ such that ρ(X̂, X) = 1 with probability ≥ (1-δ).

The key insight: **noise scales with the variance of the algorithm's output across
random subsamples** of the data. Stable algorithms (low variance) need less noise.

### The 4-step privatization template

1. **Subsample**: Draw m random 50%-subsets X₁...Xₘ from the full dataset
2. **Compute**: Run the query Q on each subset → outputs y₁...yₘ
3. **Estimate noise**: Compute variance σ² across the yᵢ. Required noise: Δ = σ²/(2β)
where β is the MI budget
4. **Release**: Pick a random subset Xⱼ, return Q(Xⱼ) + N(0, Δ)

This is the theoretical foundation. SIMD-PAC-DB encodes this efficiently using
64 parallel counters (one per possible subset assignment bit).

### MI → posterior success rate

| MI | Max posterior (prior=50%) | Max posterior (prior=25%) |
|----|--------------------------|--------------------------|
| 1/128 | 56.2% | 30.5% |
| 1/64 | 58.8% | 32.9% |
| 1/32 | 62.4% | 36.3% |
| 1/16 | 67.5% | 41.2% |
| 1/8 | 74.5% | 48.2% |
| 1/4 | 83.8% | 58.4% |
| 1/2 | 95.2% | 72.7% |
| 1 | 100% | 91.4% |

### PAC Composition

For T adaptive queries with independent random sampling per query, the total
MI is bounded by the sum: MI(total) ≤ Σᵢ MIᵢ. This is linear composition —
each query's MI adds to the budget. The key requirement: **independent random
sampling per query** (each query uses a fresh random subset).

### PAC vs DP

- **DP**: input-independent guarantee. Requires white-boxing to compute sensitivity.
Noise ∝ sensitivity/ε. Works for worst-case neighboring datasets.
- **PAC**: instance-dependent guarantee. No white-boxing needed. Noise ∝ Var[Q(X)]/β.
Stable queries get less noise automatically. But the guarantee depends on the
data distribution D.

### Core mechanism (SIMD-PAC-DB implementation)

- Each aggregate maintains **64 parallel counters** (one per bit of a hashed key)
- Each row's value is added to ~32 counters (determined by pac_hash of the PU key)
- At finalization, noise calibrated to a **mutual information bound** (pac_mi) is
added, and the result is estimated from the counters
- PAC does NOT compute sensitivity (unlike differential privacy)
- The 64 counters encode m=64 possible subsets in one pass (SIMD-efficient)

### SWAR bitslice encoding

- Counters are packed as 4 × uint16_t per uint64_t (SWAR = SIMD Within A Register)
- This enables processing 4 counters per instruction without actual SIMD intrinsics
- Overflow cascades to 32-bit overflow counters when 16-bit counters saturate

### pac_clip_sum (contribution clipping)

- **Pre-aggregation**: Query rewriter inserts `GROUP BY pu_hash` to sum each user's
rows into a single contribution (handles the "50K small items" case)
- **Magnitude levels**: Values decomposed into levels (4x per level, 2-bit shift).
Level 0: 0-255, Level 1: 256-1023, Level 2: 1024-4095, etc.
- **Bitmap tracking**: Each level maintains a 64-bit bitmap of distinct contributors
(using birthday-paradox estimation from popcount)
- **Hard-zero**: Levels with fewer distinct contributors than `pac_clip_support`
contribute nothing to the result (prevents variance side-channel attacks)

### Key settings

- `pac_mi`: Mutual information bound (0 = deterministic/no noise)
- `pac_seed`: RNG seed for reproducible noise
- `pac_clip_support`: Minimum distinct contributors per magnitude level (NULL = disabled)
- `pac_hash_repair`: Ensure pac_hash outputs exactly 32 bits set

### DDL

```sql
ALTER TABLE customer ADD PAC_KEY (c_custkey);
ALTER TABLE customer SET PU;
ALTER TABLE orders ADD PAC_LINK (o_custkey) REFERENCES customer (c_custkey);
```
23 changes: 23 additions & 0 deletions .claude/skills/run-attacks/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: run-attacks
description: Run the pac_clip_sum membership inference attack test suite and summarize results.
---

## Context

PAC (Probably Approximately Correct) privacy privatizes SQL aggregates via 64 parallel
SWAR bitslice counters with MI-bounded noise. pac_clip_sum adds per-user contribution
clipping using magnitude-level decomposition (4x bands, 2-bit shift) with distinct-contributor
bitmaps. Unsupported outlier levels are hard-zeroed to prevent variance side-channel attacks.

## Instructions

1. Build if needed: `GEN=ninja make 2>&1 | tail -5`
2. Run the main attack suite: `bash attacks/clip_attack_test.sh 2>/dev/null`
3. Run the multi-row attack: `bash attacks/clip_multirow_test.sh 2>/dev/null`
4. Run stress tests if available: `bash attacks/clip_hardzero_stress.sh 2>/dev/null`

Summarize results as a table:
- Attack scenario, clip_support value, attack accuracy, std_in, std_out, std ratio
- Flag any accuracy above 60% as a potential regression
- Compare to baselines in `attacks/clip_attack_results.md`
1 change: 1 addition & 0 deletions .claude/skills/shared
Submodule shared added at 9d673a
11 changes: 11 additions & 0 deletions .claude/skills/test-clip/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
name: test-clip
description: Build and run pac_clip_sum unit tests.
---

## Instructions

1. Build: `GEN=ninja make 2>&1 | tail -5`
2. Run clip_sum tests: `build/release/test/unittest "test/sql/pac_clip_sum*" 2>&1`
3. Report: number of assertions passed/failed
4. If any fail, show the failing test name and expected vs actual values
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,6 @@
[submodule "benchmark/sqlstorm/SQLStorm"]
path = benchmark/sqlstorm/SQLStorm
url = https://github.com/SQL-Storm/SQLStorm.git
[submodule ".claude/skills/shared"]
path = .claude/skills/shared
url = https://github.com/ila/duckdb-claude-skills.git
Loading
Loading