PolicyEngine · MaxGhenis · May 2, 2026 · Feb 25, 2026 · Feb 25, 2026 · Mar 28, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -2,7 +2,7 @@ name: CI
 
 on:
   push:
-    branches: [main, v2]
+    branches: [main]
   pull_request:
     branches: [main]
 

diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,21 @@
 # LiteLLM disk cache
 .policybench_cache/
 
+# Local scratch benchmark outputs. Frozen snapshots and batch exports live
+# elsewhere under results/ and can still be committed intentionally.
+results/local/
+results/analysis/
+results/no_tools/
+results/**/rerun_*/
+results/reference_outputs.csv
+results/ground_truth.csv
+results/scenarios.csv
+
+# Paper render scratch outputs. The served paper artifacts live under
+# app/public/paper/ and frozen manuscript inputs live under paper/snapshot/.
+paper/out/
+paper/figures/
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -9,7 +9,7 @@ ruff format .             # Format
 ```
 
 ## Architecture
-- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
+- **One condition**: AI alone (no tools)
 - **Ground truth**: policyengine-us Simulation
 - **TDD**: Write tests first, then implement
 
@@ -18,12 +18,11 @@ ruff format .             # Format
 - `policybench/scenarios.py` — Household scenario generation
 - `policybench/ground_truth.py` — PE-US calculations
 - `policybench/prompts.py` — Natural language prompt templates
-- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
-- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
+- `policybench/eval_no_tools.py` — LiteLLM-based AI-alone benchmark
 - `policybench/analysis.py` — Metrics and reporting
 
 ## Testing
-- All tests mock external calls (EDSL, LiteLLM, PE-US API)
+- All tests mock external calls (LiteLLM, PE-US API)
 - `pytest -m "not slow"` to skip slow tests
 - Full benchmark runs are manual and expensive
 

diff --git a/README.md b/README.md
@@ -1,23 +1,43 @@
 # PolicyBench
 
-Can AI models accurately calculate tax and benefit outcomes without tools?
+How well can frontier models calculate tax and benefit outcomes without tools?
 
-PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).
+PolicyBench measures how well frontier AI models estimate selected household tax
+and benefit outputs without tools.
 
-## Conditions
+For benchmark scope, snapshot policy, and terminology, see the
+[benchmark card](docs/benchmark_card.md).
+
+US benchmark scenarios are sampled from Enhanced CPS households and evaluated
+under tax year 2026 rules with PolicyEngine-US. The public UK path uses a
+UK-calibrated transfer dataset and PolicyEngine-UK reference outputs for fiscal
+year 2026-27.
+
+## Condition
 
 1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
-2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers
 
-## Models tested
+## Benchmark scope
 
-- Claude (Opus 4.6, Sonnet 4.5)
-- GPT (4o, o3)
-- Gemini 2.5 Pro
+Benchmark outputs are defined in `policybench/benchmark_specs.json`. New CLI
+runs default to `headline`, which focuses the main ranking on person- or
+household-facing outputs that contribute to household net income. PolicyEngine
+variables may be native to lower-level entities, but benchmark outputs are
+either expanded to people shown in the prompt or aggregated to the household
+before scoring. Coverage outputs are binary flags in the headline ranking; the
+separate household-equal impact score uses PolicyEngine value proxies to give
+those flags a dollar-scale weight. Intermediate tax bases and payroll
+subcomponents are excluded from the headline ranking. WIC is requested as
+person-level eligibility, not as a dollar amount.
 
 ## Programs evaluated
 
-Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.
+The current public release covers selected federal taxes, credits, benefits,
+health-related support, coverage labels, and state-tax outputs in the US, plus
+selected tax and transfer outputs in the UK. US federal income tax is scored as
+a compact decomposition: tax after nonrefundable credits and before refundable
+credits, plus refundable federal credits excluding the ACA Premium Tax Credit.
+The ACA Premium Tax Credit is scored separately as a health-related output.
 
 ## Quick start
 
@@ -26,18 +46,41 @@ pip install -e ".[dev]"
 pytest  # Run tests (mocked, no API calls)
 ```
 
-## Full benchmark
+## Benchmark run
 
 ```bash
-# Generate ground truth from PolicyEngine-US
-policybench ground-truth
+# Generate reference outputs for 100 sampled households using headline outputs
+policybench reference-outputs -n 100 --seed 42
+
+# Run AI-alone evaluations on the exported scenario manifest.
+# The standard response contract includes numeric answers and explanations.
+policybench eval-no-tools -n 100 --seed 42
+
+# For larger runs, use resumable per-model chunks.
+policybench eval-no-tools-chunked \
+  --scenario-manifest results/local/scenarios.csv \
+  --output-dir results/local/no_tools_chunked \
+  --country us \
+  --chunk-size 10 \
+  --parallel 2
 
-# Run AI-alone evaluations
-policybench eval-no-tools
+# Analyze local results and export local artifacts
+policybench analyze --output-dir results/local/analysis
+```
+
+## Repeated runs
 
-# Run AI-with-tools evaluations
-policybench eval-with-tools
+```bash
+# Optional: run the same benchmark multiple times on the saved scenario manifest
+policybench eval-no-tools-repeated -n 100 --seed 42 --repeats 3 -o results/local/no_tools/runs
 
-# Analyze results
-policybench analyze
+# Analyze the canonical point estimate plus across-run stability
+policybench analyze --runs-dir results/local/no_tools/runs --output-dir results/local/analysis
 ```
+
+`policybench reference-outputs` writes PolicyEngine reference outputs, not
+administrative truth. It also writes `results/local/scenarios.csv`, and the eval
+commands reuse that manifest by default instead of regenerating households from
+the current source dataset. Prediction CSVs also get a `.meta.json` sidecar so
+resumes only happen against the exact same manifest, model set, and program set.
+`policybench ground-truth` remains as a compatibility alias.
diff --git a/RESULTS.md b/RESULTS.md
@@ -1,71 +1,78 @@
-# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that
-
-> Can frontier AI models accurately calculate US tax and benefit outcomes?
-
-**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.**
-
-## Setup
-
-- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
-- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more
-- **4 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.6
-- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
-- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs)
-- **Total predictions**: 9,800 (5,600 no-tools + 4,200 with-tools)
-
-## Headline results
-
-### Without tools (AI alone)
-
-| Model | MAE | MAPE | Within 10% |
-|:------|----:|-----:|----------:|
-| Claude Sonnet 4.6 | $1,285 | 52% | 72.3% |
-| Claude Opus 4.6 | $1,257 | 85% | 70.8% |
-| GPT-5.2 | $2,578 | 78% | 62.1% |
-| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% |
-
-### With PolicyEngine tools
-
-| Model | MAE | MAPE | Within 10% |
-|:------|----:|-----:|----------:|
-| Claude Opus 4.6 | **$0** | **0%** | **100.0%** |
-| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** |
-| GPT-5.2 | **$0** | **0%** | **100.0%** |
-
-### By program (AI alone, all models averaged)
-
-| Program | MAE | MAPE | Within 10% |
-|:--------|----:|-----:|----------:|
-| Federal income tax | $4,234 | 54% | 41.0% |
-| Income tax before credits | $2,683 | 39% | 62.7% |
-| EITC | $727 | 298% | 75.3% |
-| CTC | $1,028 | 174% | 74.3% |
-| Refundable credits | $981 | 128% | 62.3% |
-| SNAP | $769 | 55% | 80.7% |
-| SSI | $436 | 100% | 95.7% |
-| State income tax | $938 | 76% | 59.7% |
-| Household net income | $10,586 | 14% | 66.0% |
-| Total benefits | $5,228 | 117% | 43.7% |
-| Market income | $0 | 0% | 100.0% |
-| Marginal tax rate | $347 | N/A | 18.0% |
-
-## Key takeaways
-
-1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–72%). The choice of computational tool matters more than the choice of frontier model.
-
-2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Sonnet 4.6) averages $1,285 error per calculation and gets only 72% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most.
-
-3. **With tools, accuracy is perfect.** All three tested models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it.
-
-4. **Newer models are improving, but not enough.** Claude Sonnet 4.6 improved significantly over 4.5 (72% vs 62% within 10%), but still falls far short of the 100% achievable with tools. Model improvements can't substitute for computational tools.
-
-5. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate. This makes AI-generated policy advice about work incentives unreliable without computational backing.
-
-6. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge.
+# PolicyBench Results
+
+PolicyBench is a no-tools benchmark. Ad hoc local outputs should live under
+`results/local/` after a benchmark run. Published leaderboard claims should
+instead point to dated batch directories or to a committed dashboard export
+such as `app/src/data.json`.
+
+## Run
+
+```bash
+policybench reference-outputs -n 100 --seed 42
+policybench eval-no-tools -n 100 --seed 42
+policybench analyze --output-dir results/local/analysis
+```
+
+The first command writes PolicyEngine reference outputs, not administrative
+truth. `policybench ground-truth` remains as a compatibility alias.
+
+## Full runbook
+
+Use a dated batch directory and keep model outputs per country and model so
+interrupted runs can resume independently.
+
+```bash
+RUN_DIR=results/full_batch_20260501
+
+policybench reference-outputs -n 1000 --seed 42 --country us --program-set headline \
+  -o "$RUN_DIR/us/reference_outputs.csv" \
+  --scenario-manifest-output "$RUN_DIR/us/scenarios.csv"
+
+policybench reference-outputs -n 1000 --seed 42 --country uk --program-set headline \
+  -o "$RUN_DIR/uk/reference_outputs.csv" \
+  --scenario-manifest-output "$RUN_DIR/uk/scenarios.csv"
+
+for country in us uk; do
+  for model in claude-opus-4.7 claude-sonnet-4.6 claude-haiku-4.5 \
+    grok-4.3 grok-4.20 grok-4.1-fast gpt-5.5 gpt-5.4-mini gpt-5.4-nano \
+    gemini-3.1-pro-preview gemini-3-flash-preview \
+    gemini-3.1-flash-lite-preview; do
+    policybench eval-no-tools-chunked \
+      --scenario-manifest "$RUN_DIR/$country/scenarios.csv" \
+      --output-dir "$RUN_DIR/$country/no_tools_chunked" \
+      --country "$country" \
+      --model "$model" \
+      --program-set headline \
+      --chunk-size 50 \
+      --parallel 4
+  done
+done
+
+for country in us uk; do
+  mkdir -p "$RUN_DIR/$country/by_model"
+  cp "$RUN_DIR/$country/no_tools_chunked/by_model/"*.csv "$RUN_DIR/$country/by_model/"
+done
+
+python scripts/export_full_run.py --run-dir "$RUN_DIR"
+```
+
+For first-pass cost control, run the same commands with `-n 100` in a separate
+scratch directory before launching the 1,000-household batch.
+
+## Artifacts
+
+- `results/local/reference_outputs.csv`
+- `results/local/no_tools/predictions.csv`
+- `results/local/analysis/metrics.csv`
+- `results/local/analysis/summary_by_model.csv`
+- `results/local/analysis/summary_by_variable.csv`
+- `results/local/analysis/impact_summary_by_model.csv`
+- `results/local/analysis/usage_summary.csv`
+- `results/local/analysis/report.md`
 
 ## Methodology
 
-See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility.
+See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Reference outputs are computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us) and [PolicyEngine-UK](https://github.com/PolicyEngine/policyengine-uk); the public UK scenarios use PolicyBench's calibrated transfer dataset. LLM responses are cached for reproducibility.
 
 ---
-*[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*
+*[PolicyEngine](https://policyengine.org) · [PolicyBench](https://policybench.org)*
diff --git a/app/.gitignore b/app/.gitignore
@@ -10,6 +10,9 @@ lerna-debug.log*
 node_modules
 dist
 dist-ssr
+.next
+out
+*.tsbuildinfo
 *.local
 
 # Editor directories and files
@@ -22,3 +25,5 @@ dist-ssr
 *.njsproj
 *.sln
 *.sw?
+.vercel
+.env*.local
diff --git a/app/.vercelignore b/app/.vercelignore
@@ -0,0 +1,9 @@
+.git
+.gitignore
+.next
+dist
+node_modules
+output
+playwright-report
+coverage
+*.log