Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
c985d89
fixup! fixup! Add Claude Sonnet 4.6 to benchmark — now best no-tools …
MaxGhenis Feb 25, 2026
dc3069f
Add interactive dashboard with Cosilico theme
MaxGhenis Feb 25, 2026
6cd9bce
Migrate dashboard to Next.js and refresh benchmark app
MaxGhenis Mar 28, 2026
ef2941f
Track dollar formatting helper
MaxGhenis Mar 28, 2026
7c2ae88
Refine policybench benchmark pipeline
MaxGhenis Mar 29, 2026
f508135
Repair partial batch answers
MaxGhenis Mar 29, 2026
d9aa9c9
Increase Gemini Pro batch output budget
MaxGhenis Mar 29, 2026
fcd73aa
Adopt bounded benchmark scoring
MaxGhenis Mar 29, 2026
aca14f4
Expand benchmark app, diagnostics, and paper publishing
MaxGhenis Apr 4, 2026
205c01b
Refresh rendered paper assets
MaxGhenis Apr 5, 2026
6e40c72
Fix paper PDF render pipeline
MaxGhenis Apr 5, 2026
095063d
Remove leaderboard leader callout
MaxGhenis Apr 5, 2026
3548002
Compact hero and sticky nav for better above-the-fold UX
MaxGhenis Apr 5, 2026
88ad74b
Fix all ruff lint errors (122 issues)
MaxGhenis Apr 5, 2026
5d3ade9
Unified collapsing header with PolicyEngine branding
MaxGhenis Apr 5, 2026
05ddb58
Smooth scroll-driven header collapse, "by PolicyEngine" branding
MaxGhenis Apr 5, 2026
6888557
Tighten benchmark artifacts and analysis
MaxGhenis Apr 13, 2026
5367657
Fix diagnostics explanation tooltip
daphnehanse11 Apr 16, 2026
28de8e7
Track main as the canonical CI branch
daphnehanse11 Apr 16, 2026
9504a62
Preserve pre-v2 main history on main
daphnehanse11 Apr 16, 2026
6b4dbb2
Use saved scenario manifests and guard benchmark resumes
daphnehanse11 Apr 16, 2026
7657e6a
Make benchmark answer extraction strict
daphnehanse11 Apr 17, 2026
29497ea
Fix ruff lint and format issues
MaxGhenis Apr 18, 2026
d245687
Rebuild benchmark output contract
MaxGhenis Apr 25, 2026
c88599c
Expand headline scope to net income components
MaxGhenis Apr 25, 2026
1693950
Add prompt mode comparison script
MaxGhenis Apr 25, 2026
44e7a03
Migrate PolicyBench runtime to policyengine.py
MaxGhenis Apr 26, 2026
4273c41
Support UK employment income leaf inputs
MaxGhenis Apr 26, 2026
f5c3668
Filter conditional prompt inputs
MaxGhenis Apr 27, 2026
cadcd60
Add provider request wall timeout
MaxGhenis Apr 27, 2026
27d329c
Omit aggregate net worth from prompts
MaxGhenis Apr 27, 2026
15fa09e
Clarify benchmark prompt contract
MaxGhenis Apr 27, 2026
913ed76
Rename UK transfer scenario source
MaxGhenis Apr 27, 2026
e2c9daa
Suppress prior self-employment sentinel input
MaxGhenis Apr 27, 2026
d46daf4
Omit prior-year inputs from prompts
MaxGhenis Apr 27, 2026
cf48728
Remove noisy prompt inputs
MaxGhenis Apr 27, 2026
9d3a88e
Prefer current spec prompt descriptions
MaxGhenis Apr 27, 2026
4b47716
Remove v1 benchmark code
MaxGhenis Apr 27, 2026
a2d5287
Scale response budget for expanded outputs
MaxGhenis Apr 27, 2026
9486494
Publish UK 100-household benchmark
MaxGhenis Apr 30, 2026
9b1a968
Publish 100-household US and UK benchmark results
MaxGhenis May 1, 2026
fef6d94
Keep DeepSeek experimental only
MaxGhenis May 1, 2026
59b17c7
Tighten explanation consistency contract
MaxGhenis May 1, 2026
db73c84
Clean benchmark artifacts and freeze paper snapshot
MaxGhenis May 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: CI

on:
push:
branches: [main, v2]
branches: [main]
pull_request:
branches: [main]

Expand Down
15 changes: 15 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,21 @@
# LiteLLM disk cache
.policybench_cache/

# Local scratch benchmark outputs. Frozen snapshots and batch exports live
# elsewhere under results/ and can still be committed intentionally.
results/local/
results/analysis/
results/no_tools/
results/**/rerun_*/
results/reference_outputs.csv
results/ground_truth.csv
results/scenarios.csv

# Paper render scratch outputs. The served paper artifacts live under
# app/public/paper/ and frozen manuscript inputs live under paper/snapshot/.
paper/out/
paper/figures/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
7 changes: 3 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ ruff format . # Format
```

## Architecture
- **Two conditions**: AI alone (EDSL) vs AI with PE tools (LiteLLM)
- **One condition**: AI alone (no tools)
- **Ground truth**: policyengine-us Simulation
- **TDD**: Write tests first, then implement

Expand All @@ -18,12 +18,11 @@ ruff format . # Format
- `policybench/scenarios.py` — Household scenario generation
- `policybench/ground_truth.py` — PE-US calculations
- `policybench/prompts.py` — Natural language prompt templates
- `policybench/eval_no_tools.py` — EDSL-based AI-alone benchmark
- `policybench/eval_with_tools.py` — LiteLLM tool-calling benchmark
- `policybench/eval_no_tools.py` — LiteLLM-based AI-alone benchmark
- `policybench/analysis.py` — Metrics and reporting

## Testing
- All tests mock external calls (EDSL, LiteLLM, PE-US API)
- All tests mock external calls (LiteLLM, PE-US API)
- `pytest -m "not slow"` to skip slow tests
- Full benchmark runs are manual and expensive

Expand Down
79 changes: 61 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,43 @@
# PolicyBench

Can AI models accurately calculate tax and benefit outcomes without tools?
How well can frontier models calculate tax and benefit outcomes without tools?

PolicyBench measures how well frontier AI models estimate US tax/benefit values for specific households — both **without tools** (pure reasoning) and **with PolicyEngine tools** (API access to ground truth).
PolicyBench measures how well frontier AI models estimate selected household tax
and benefit outputs without tools.

## Conditions
For benchmark scope, snapshot policy, and terminology, see the
[benchmark card](docs/benchmark_card.md).

US benchmark scenarios are sampled from Enhanced CPS households and evaluated
under tax year 2026 rules with PolicyEngine-US. The public UK path uses a
UK-calibrated transfer dataset and PolicyEngine-UK reference outputs for fiscal
year 2026-27.

## Condition

1. **AI alone**: Models estimate tax/benefit values using only their training knowledge
2. **AI with PolicyEngine**: Models use a PolicyEngine tool to compute exact answers

## Models tested
## Benchmark scope

- Claude (Opus 4.6, Sonnet 4.5)
- GPT (4o, o3)
- Gemini 2.5 Pro
Benchmark outputs are defined in `policybench/benchmark_specs.json`. New CLI
runs default to `headline`, which focuses the main ranking on person- or
household-facing outputs that contribute to household net income. PolicyEngine
variables may be native to lower-level entities, but benchmark outputs are
either expanded to people shown in the prompt or aggregated to the household
before scoring. Coverage outputs are binary flags in the headline ranking; the
separate household-equal impact score uses PolicyEngine value proxies to give
those flags a dollar-scale weight. Intermediate tax bases and payroll
subcomponents are excluded from the headline ranking. WIC is requested as
person-level eligibility, not as a dollar amount.

## Programs evaluated

Federal tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state income tax, net income, marginal tax rates, and more.
The current public release covers selected federal taxes, credits, benefits,
health-related support, coverage labels, and state-tax outputs in the US, plus
selected tax and transfer outputs in the UK. US federal income tax is scored as
a compact decomposition: tax after nonrefundable credits and before refundable
credits, plus refundable federal credits excluding the ACA Premium Tax Credit.
The ACA Premium Tax Credit is scored separately as a health-related output.

## Quick start

Expand All @@ -26,18 +46,41 @@ pip install -e ".[dev]"
pytest # Run tests (mocked, no API calls)
```

## Full benchmark
## Benchmark run

```bash
# Generate ground truth from PolicyEngine-US
policybench ground-truth
# Generate reference outputs for 100 sampled households using headline outputs
policybench reference-outputs -n 100 --seed 42

# Run AI-alone evaluations on the exported scenario manifest.
# The standard response contract includes numeric answers and explanations.
policybench eval-no-tools -n 100 --seed 42

# For larger runs, use resumable per-model chunks.
policybench eval-no-tools-chunked \
--scenario-manifest results/local/scenarios.csv \
--output-dir results/local/no_tools_chunked \
--country us \
--chunk-size 10 \
--parallel 2

# Run AI-alone evaluations
policybench eval-no-tools
# Analyze local results and export local artifacts
policybench analyze --output-dir results/local/analysis
```

## Repeated runs

# Run AI-with-tools evaluations
policybench eval-with-tools
```bash
# Optional: run the same benchmark multiple times on the saved scenario manifest
policybench eval-no-tools-repeated -n 100 --seed 42 --repeats 3 -o results/local/no_tools/runs

# Analyze results
policybench analyze
# Analyze the canonical point estimate plus across-run stability
policybench analyze --runs-dir results/local/no_tools/runs --output-dir results/local/analysis
```

`policybench reference-outputs` writes PolicyEngine reference outputs, not
administrative truth. It also writes `results/local/scenarios.csv`, and the eval
commands reuse that manifest by default instead of regenerating households from
the current source dataset. Prediction CSVs also get a `.meta.json` sidecar so
resumes only happen against the exact same manifest, model set, and program set.
`policybench ground-truth` remains as a compatibility alias.
139 changes: 73 additions & 66 deletions RESULTS.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,78 @@
# PolicyBench: AI can't accurately calculate taxes and benefits — but tools fix that

> Can frontier AI models accurately calculate US tax and benefit outcomes?

**TL;DR: No — but with PolicyEngine tools, they achieve 100% accuracy.**

## Setup

- **100 household scenarios** across 12 US states, varying income ($0–$500k), filing status, and family composition
- **14 tax/benefit programs**: federal income tax, EITC, CTC, SNAP, SSI, Medicaid eligibility, state taxes, and more
- **4 frontier models**: GPT-5.2, Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4.6
- **2 conditions**: AI alone (parametric knowledge only) vs. AI with PolicyEngine tools
- **Ground truth**: PolicyEngine-US microsimulation (1,400 scenario-program pairs)
- **Total predictions**: 9,800 (5,600 no-tools + 4,200 with-tools)

## Headline results

### Without tools (AI alone)

| Model | MAE | MAPE | Within 10% |
|:------|----:|-----:|----------:|
| Claude Sonnet 4.6 | $1,285 | 52% | 72.3% |
| Claude Opus 4.6 | $1,257 | 85% | 70.8% |
| GPT-5.2 | $2,578 | 78% | 62.1% |
| Claude Sonnet 4.5 | $2,276 | 125% | 61.9% |

### With PolicyEngine tools

| Model | MAE | MAPE | Within 10% |
|:------|----:|-----:|----------:|
| Claude Opus 4.6 | **$0** | **0%** | **100.0%** |
| Claude Sonnet 4.5 | **$0** | **0%** | **100.0%** |
| GPT-5.2 | **$0** | **0%** | **100.0%** |

### By program (AI alone, all models averaged)

| Program | MAE | MAPE | Within 10% |
|:--------|----:|-----:|----------:|
| Federal income tax | $4,234 | 54% | 41.0% |
| Income tax before credits | $2,683 | 39% | 62.7% |
| EITC | $727 | 298% | 75.3% |
| CTC | $1,028 | 174% | 74.3% |
| Refundable credits | $981 | 128% | 62.3% |
| SNAP | $769 | 55% | 80.7% |
| SSI | $436 | 100% | 95.7% |
| State income tax | $938 | 76% | 59.7% |
| Household net income | $10,586 | 14% | 66.0% |
| Total benefits | $5,228 | 117% | 43.7% |
| Market income | $0 | 0% | 100.0% |
| Marginal tax rate | $347 | N/A | 18.0% |

## Key takeaways

1. **Tools > models.** Every model with PolicyEngine (100% accuracy) vastly outperforms every model without it (62–72%). The choice of computational tool matters more than the choice of frontier model.

2. **AI alone is unreliable for policy calculations.** Even the best model (Claude Sonnet 4.6) averages $1,285 error per calculation and gets only 72% of answers within 10% of correct. The worst programs — income tax (41%), marginal tax rates (18%), and aggregate benefits (44%) — are precisely where accuracy matters most.

3. **With tools, accuracy is perfect.** All three tested models achieve $0 MAE and 100% within-10% accuracy across all 4,200 with-tools predictions. The tool returns ground truth, and models faithfully report it.

4. **Newer models are improving, but not enough.** Claude Sonnet 4.6 improved significantly over 4.5 (72% vs 62% within 10%), but still falls far short of the 100% achievable with tools. Model improvements can't substitute for computational tools.

5. **Marginal tax rates are nearly impossible without tools.** Only 18% of AI-alone predictions are within 10% of the correct marginal rate. This makes AI-generated policy advice about work incentives unreliable without computational backing.

6. **The benchmark validates PolicyEngine's value proposition.** Any AI system that needs to answer questions about US taxes and benefits should use PolicyEngine rather than relying on parametric knowledge.
# PolicyBench Results

PolicyBench is a no-tools benchmark. Ad hoc local outputs should live under
`results/local/` after a benchmark run. Published leaderboard claims should
instead point to dated batch directories or to a committed dashboard export
such as `app/src/data.json`.

## Run

```bash
policybench reference-outputs -n 100 --seed 42
policybench eval-no-tools -n 100 --seed 42
policybench analyze --output-dir results/local/analysis
```

The first command writes PolicyEngine reference outputs, not administrative
truth. `policybench ground-truth` remains as a compatibility alias.

## Full runbook

Use a dated batch directory and keep model outputs per country and model so
interrupted runs can resume independently.

```bash
RUN_DIR=results/full_batch_20260501

policybench reference-outputs -n 1000 --seed 42 --country us --program-set headline \
-o "$RUN_DIR/us/reference_outputs.csv" \
--scenario-manifest-output "$RUN_DIR/us/scenarios.csv"

policybench reference-outputs -n 1000 --seed 42 --country uk --program-set headline \
-o "$RUN_DIR/uk/reference_outputs.csv" \
--scenario-manifest-output "$RUN_DIR/uk/scenarios.csv"

for country in us uk; do
for model in claude-opus-4.7 claude-sonnet-4.6 claude-haiku-4.5 \
grok-4.3 grok-4.20 grok-4.1-fast gpt-5.5 gpt-5.4-mini gpt-5.4-nano \
gemini-3.1-pro-preview gemini-3-flash-preview \
gemini-3.1-flash-lite-preview; do
policybench eval-no-tools-chunked \
--scenario-manifest "$RUN_DIR/$country/scenarios.csv" \
--output-dir "$RUN_DIR/$country/no_tools_chunked" \
--country "$country" \
--model "$model" \
--program-set headline \
--chunk-size 50 \
--parallel 4
done
done

for country in us uk; do
mkdir -p "$RUN_DIR/$country/by_model"
cp "$RUN_DIR/$country/no_tools_chunked/by_model/"*.csv "$RUN_DIR/$country/by_model/"
done

python scripts/export_full_run.py --run-dir "$RUN_DIR"
```

For first-pass cost control, run the same commands with `-n 100` in a separate
scratch directory before launching the 1,000-household batch.

## Artifacts

- `results/local/reference_outputs.csv`
- `results/local/no_tools/predictions.csv`
- `results/local/analysis/metrics.csv`
- `results/local/analysis/summary_by_model.csv`
- `results/local/analysis/summary_by_variable.csv`
- `results/local/analysis/impact_summary_by_model.csv`
- `results/local/analysis/usage_summary.csv`
- `results/local/analysis/report.md`

## Methodology

See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Ground truth is computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us). All API responses are cached for reproducibility.
See the [full paper](docs/) and [benchmark code](policybench/) for complete methodology. Reference outputs are computed via [PolicyEngine-US](https://github.com/PolicyEngine/policyengine-us) and [PolicyEngine-UK](https://github.com/PolicyEngine/policyengine-uk); the public UK scenarios use PolicyBench's calibrated transfer dataset. LLM responses are cached for reproducibility.

---
*[Cosilico](https://cosilico.ai) · [PolicyEngine](https://policyengine.org)*
*[PolicyEngine](https://policyengine.org) · [PolicyBench](https://policybench.org)*
5 changes: 5 additions & 0 deletions app/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ lerna-debug.log*
node_modules
dist
dist-ssr
.next
out
*.tsbuildinfo
*.local

# Editor directories and files
Expand All @@ -22,3 +25,5 @@ dist-ssr
*.njsproj
*.sln
*.sw?
.vercel
.env*.local
9 changes: 9 additions & 0 deletions app/.vercelignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.git
.gitignore
.next
dist
node_modules
output
playwright-report
coverage
*.log
Loading