diff --git a/app/public/paper/policybench.pdf b/app/public/paper/policybench.pdf
index d8c0c05..f03d083 100644
Binary files a/app/public/paper/policybench.pdf and b/app/public/paper/policybench.pdf differ
diff --git a/app/public/paper/web/index.html b/app/public/paper/web/index.html
index cb6a72d..9fb7475 100644
--- a/app/public/paper/web/index.html
+++ b/app/public/paper/web/index.html
@@ -159,6 +159,7 @@
Benchmark design
within 10%
For binary outputs, the score is exact accuracy. This keeps 100 as the ceiling while still giving partial credit for near misses on amount outputs. PolicyBench also tracks mean absolute error and related error metrics, but those are secondary to the bounded score. This choice preserves exact-match comparability while avoiding the failure mode that recent numeric-evaluation papers have criticized (Abbood et al. 2025).
+Because the four thresholds are nested (exact ⊆ within-1% ⊆ within-5% ⊆ within-10%), averaging their indicator functions is equivalent to assigning step credit by error band: predictions that match exactly receive 1.00, predictions inside 1% but not exact receive 0.75, predictions inside 5% but outside 1% receive 0.50, predictions inside 10% but outside 5% receive 0.25, and predictions outside 10% receive 0.00. The score is therefore a partial-credit rule that rewards tighter agreement more than looser agreement, not an unweighted aggregate of independent hit rates.
All requested US outputs are annual amounts or annual eligibility indicators for tax year 2026; UK outputs are annual amounts or annual eligibility indicators for fiscal year 2026-27. For currency amounts, the “exact” hit rate means within 1 currency unit of the reference value after numeric parsing. Percentage-threshold hit rates use relative error when the reference value is nonzero and the same 1 currency-unit absolute tolerance when the reference value is zero. Binary outputs are parsed as 0 or 1 eligibility flags and scored by exact classification accuracy. Missing or unparseable numeric answers receive zero score for that requested output.
Aggregation proceeds in four steps. First, each household-output prediction receives a 0-100 score. Second, person-level coverage rows are averaged to their program group. Third, output groups are averaged with equal weight within each country. Fourth, global scores average the US and UK country scores for models run in both countries. The global score is therefore an equal-country summary for this benchmark sample, not a universally authoritative measure of tax-benefit competence. Household-equal impact scores and other alternative views are reported as sensitivity checks.
The benchmark requires each model response to include numeric answers and one explanation per requested output. The headline score uses only the numeric answers. Explanations are retained for scenario exploration and qualitative error analysis; they should not be interpreted as faithful traces of model reasoning.
@@ -295,35 +296,45 @@ Frozen snapshot an
| 19 |
UK scenario source |
-enhanced_cps_2025.h5 public UK calibrated transfer artifact |
+enhanced_cps_2025.h5 |
| 20 |
+UK scenario-source repository |
+PolicyEngine/policyengine-uk-data |
+
+
+| 21 |
+UK scenario-source pinned commit |
+9514dfb7ec607897c9f7122a2e073b922c9fd8b6 |
+
+
+| 22 |
UK scenario-source SHA-256 |
199ebc61d29231b4799ad337a95393765b5fb5aede1834b93ff2acecceded866 |
-| 21 |
+23 |
Households |
100 US and 100 UK |
-| 22 |
+24 |
Models |
12 shared models |
-| 23 |
+25 |
Output groups |
19 US and 7 UK |
-| 24 |
+26 |
Condition |
No tools, no web access, one structured response per household |
-| 25 |
+27 |
Response contract |
Numeric answer and non-empty explanation for every requested output |
@@ -491,7 +502,7 @@ Frozen snapshot an
Data and scenario construction
United States
-
The US benchmark is built from Enhanced Current Population Survey (CPS)-derived households using PolicyEngine US. The sampled households are filtered to keep a single-tax-unit structure while retaining variation in filing status, household composition, and income sources. Prompts include nonzero promptable raw inputs across relevant entities rather than a hand-curated summary, so the models see many of the same facts the simulator receives.
+The US benchmark is built from Enhanced Current Population Survey (CPS)-derived households using PolicyEngine US. The sampled households are filtered to keep a single-tax-unit, single-family, single-Supplemental Poverty Measure (SPM)-unit structure with at least one adult and a supported filing status. The 2024 Enhanced CPS source contains 41,314 households; 30,173 (73.0%) pass the filter and form the eligible draw. The 27.0% excluded by the filter include multi-tax-unit households (e.g., adult roommates), multi-family households, multi-SPM-unit households, and households whose head reports a filing status outside the supported set. These excluded compositions are exactly the kind of cases where federal/state credit allocations and benefit-unit rules become hardest, so the eligible draw is a tractable subset rather than the full distribution of US households. Prompts include nonzero promptable raw inputs across relevant entities rather than a hand-curated summary, so the models see many of the same facts the simulator receives. Filing status is not stated in the prompt; the reference computation infers it from tax-unit role flags. Models therefore see the same household facts that drive the reference filing-status assignment, but they do not receive that assignment as a label.
The current US release evaluates 19 output groups spanning federal income tax, refundable credits, payroll and self-employment tax, state and local income tax, Supplemental Nutrition Assistance Program (SNAP), Supplemental Security Income (SSI), Temporary Assistance for Needy Families (TANF), Affordable Care Act (ACA) premium tax credits, school-meal eligibility, and person-level coverage eligibility for the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), Medicaid, the Children’s Health Insurance Program (CHIP), Medicare, Head Start, and Early Head Start.
The output scope is intentionally narrower than the full PolicyEngine model. Table 3 summarizes the inclusion rule. The benchmark asks for WIC eligibility rather than a WIC dollar amount; WIC dollar values are used only as impact-weight proxies for coverage flags, not as requested model outputs.
@@ -561,10 +572,10 @@
United States
United Kingdom
-The UK benchmark is built from a calibrated public transfer dataset scored through PolicyEngine UK. The current public build starts from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS, maps those records into UK-facing inputs, and recalibrates them to selected UK targets. This creates a public UK-policy transfer benchmark without publishing restricted household microdata; it is not a representative evaluation over native UK household records and should not be treated as a substitute for Family Resources Survey (FRS)-based UK microdata. The current UK release evaluates seven outputs: Income Tax, National Insurance, Capital Gains Tax, Child Benefit, Universal Credit, Pension Credit, and Personal Independence Payment (PIP). Outputs that depend on status or award facts use prompt-visible facts rather than hidden take-up labels; for example, PIP-positive cases list the daily living and mobility award components used by PolicyEngine.
+The UK benchmark is built from a calibrated public transfer dataset scored through PolicyEngine UK. The current public build starts from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS, maps those records into UK-facing inputs, and recalibrates them to selected UK targets. The resulting enhanced_cps_2025.h5 artifact is checked in to the public PolicyEngine/policyengine-uk-data GitHub repository; the manuscript pins commit 9514dfb7ec607897c9f7122a2e073b922c9fd8b6 so that a third party can retrieve the exact file used here. The artifact contains 28,532 households; 28,502 (99.9%) pass the eligibility filter that retains households with one benefit unit and one or two adults. This creates a public UK-policy transfer benchmark without publishing restricted household microdata; it is not a representative evaluation over native UK household records and should not be treated as a substitute for Family Resources Survey (FRS)-based UK microdata. The current UK release evaluates seven outputs: Income Tax, National Insurance, Capital Gains Tax, Child Benefit, Universal Credit, Pension Credit, and Personal Independence Payment (PIP). Outputs that depend on status or award facts use prompt-visible facts rather than hidden take-up labels; for example, PIP-positive cases list the daily living and mobility award components used by PolicyEngine.
The UK data path is more synthetic than the enhanced FRS pipeline and inherits limitations from cross-country transfer, calibration choices, and the subset of variables that can be made prompt-visible. It supports the current public cross-country benchmark, but it is not equivalent to an enhanced-FRS-based benchmark and should not be used to make population-representative claims about UK households (Sutherland and Figari 2013).
Reference-output credibility
-PolicyBench treats PolicyEngine outputs as benchmark reference outputs, not as administrative records. The reference source is nevertheless stronger than an ad hoc answer key: PolicyEngine is open source, used for household calculators and reform analysis, and externally checked in specific domains. In the UK, No. 10’s data science team adapted PolicyEngine’s open-source microsimulation model for experimental policy simulation, with validation against external projections before use (Woodruff 2026; Ghenis 2026). In the US, PolicyEngine’s state tax modelling has been validated against the National Bureau of Economic Research (NBER) TAXSIM model, with reported penny-level agreement for the vast majority of 2021 test cases (PolicyEngine 2024; Feenberg and Coutts 1993). PolicyEngine has also signed a memorandum of understanding with the Federal Reserve Bank of Atlanta for future validation work against the Atlanta Fed’s Policy Rules Database (Ghenis and Makarchuk 2025; Federal Reserve Bank of Atlanta 2026). The Atlanta Fed sources are a caveat rather than evidence of completed validation for this benchmark: they document planned collaboration and the comparison source, not finished checks of the frozen PolicyBench outputs. Taken together, these sources support using PolicyEngine as a transparent reference implementation with partial external validation, but they do not validate every benchmark output.
+PolicyBench treats PolicyEngine outputs as benchmark reference outputs, not as administrative records. The reference source is nevertheless stronger than an ad hoc answer key: PolicyEngine is open source, used for household calculators and reform analysis, and externally checked in specific domains. In the UK, No. 10’s data science team adapted PolicyEngine’s open-source microsimulation model for experimental policy simulation, with validation against external projections before use (Woodruff 2026; Ghenis 2026). In the US, PolicyEngine reports matching the National Bureau of Economic Research (NBER) TAXSIM-35 model to the cent on the vast majority of cases for the 2021 tax year across hundreds of thousands of tax units per state, with state-specific differences documented in PolicyEngine’s integration tests (PolicyEngine 2024; Feenberg and Coutts 1993). We do not restate that comparison as a single percentage because the published source uses qualitative phrasing rather than a headline accuracy number. PolicyEngine has also signed a memorandum of understanding with the Federal Reserve Bank of Atlanta for future validation work against the Atlanta Fed’s Policy Rules Database (Ghenis and Makarchuk 2025; Federal Reserve Bank of Atlanta 2026). The Atlanta Fed sources are a caveat rather than evidence of completed validation for this benchmark: they document planned collaboration and the comparison source, not finished checks of the frozen PolicyBench outputs. Taken together, these sources support using PolicyEngine as a transparent reference implementation with partial external validation, but they do not validate every benchmark output.
This does not make PolicyEngine infallible. During benchmark development, we performed a manual, developer-led discrepancy review, with LLM assistance used to triage and summarize candidate cases surfaced by model explanations. Table 4 summarizes the main discrepancy classes reviewed before the frozen snapshot. In those reviewed classes, discrepancies reflected model mistakes, prompt ambiguity later corrected, or upstream data/model issues fixed before the frozen snapshot; the reviewed discrepancy classes did not identify unresolved PolicyEngine reference-output defects. This is a qualitative development audit, not an independent sampled validation study or exhaustive validation of every reference value.
@@ -1129,16 +1140,95 @@
Sensitivity to benchmark
-Sampling uncertainty
-The manuscript snapshot uses 100 households per country, so small score gaps should not be overinterpreted. Table 10 reports a household-resampling bootstrap over the frozen sample using 1,000 percentile bootstrap draws. It resamples households within each country, recomputes country scores with equal output-group weighting, then recomputes the equal-country global score. The table is a sampling-uncertainty check for this household draw, not uncertainty over future model releases or prompt variants.
+The household-equal impact score weights each output group inside a household by a blend of an equal-weight floor and the absolute reference-value share. The default uses a 0.3 floor. Table 10 reports the top three global ranks under floors of 0.0, 0.1, 0.3, 0.5, and 1.0. Floor 1.0 collapses to the equal-output baseline; floor 0.0 is pure dollar-share weighting. Rank stability across this range means the impact view is not driven by the specific floor choice.
-
+
Sampling uncertainty
+
The manuscript snapshot uses 100 households per country, so small score gaps should not be overinterpreted. Table 11 reports a household-resampling bootstrap over the frozen sample using 1,000 percentile bootstrap draws. It resamples households within each country, recomputes country scores with equal output-group weighting, then recomputes the equal-country global score. The intervals therefore quantify uncertainty from the specific 100-household draw used here. They do not capture (i) prompt-template variance, since the manuscript uses a single template per country with no paraphrase resamples; (ii) decoding stochasticity, since each model’s snapshot prediction is a single sample at the provider’s default decoding settings; (iii) provider-side drift after 2026-05-01 in alias-resolved model weights; or (iv) reference-output uncertainty inside PolicyEngine. Total uncertainty over those sources is wider than the household-bootstrap intervals.
+
+
+
Fourth, structured-output reliability can affect rankings. The current manuscript snapshot has full parse coverage for the included models, but the benchmark still tracks coverage because failures to return parseable numeric values should count as benchmark failures (Shorten et al. 2024).
Limitations
PolicyBench is not a substitute for a production tax-and-benefit calculator. Several caveats matter:
diff --git a/docs/discussion.md b/docs/discussion.md
deleted file mode 100644
index c2b937e..0000000
--- a/docs/discussion.md
+++ /dev/null
@@ -1,101 +0,0 @@
----
-title: Discussion
----
-
-# Discussion
-
-## Where models fail
-
-The AI-alone results reveal systematic patterns in model errors that appear related to the underlying structure of tax and benefit programs.
-
-**Multi-step tax quantities are consistently difficult.** In the current
-snapshot, US federal income tax before refundable credits, state income tax
-before refundable credits, payroll tax, and UK Income Tax and National
-Insurance are among the lowest-scoring outputs. These quantities require the
-model to identify the right income concept, deduction or credit sequence,
-threshold, rate schedule, and jurisdiction-specific rule before doing the final
-arithmetic.
-
-**Positive benefit cases remain hard even when zero cases are easy.** Sparse
-programs can look high-performing because many households have a true value of
-zero. Positive Supplemental Nutrition Assistance Program (SNAP), Universal
-Credit, Pension Credit, Personal Independence Payment (PIP), and similar cases
-are more informative: models often recognize that a program exists, but miss
-the income test, asset treatment, award level, or taper when the reference value
-is positive.
-
-**Phase-outs and cliffs create discontinuities.** The Earned Income Tax Credit
-(EITC), Child Tax Credit (CTC), Affordable Care Act (ACA) Premium Tax Credit,
-Universal Credit, and many state tax provisions have thresholds, phase-ins,
-phase-outs, and eligibility cliffs. Models often produce smooth approximations
-where the reference function is piecewise and sometimes discontinuous.
-
-**Jurisdiction-specific rules add complexity.** State and local income tax in
-the US and UK fiscal-year rules require jurisdiction-specific thresholds,
-rates, and credit interactions. The benchmark therefore tests more than generic
-tax bracket recall; it tests whether models can apply the right rules to a
-concrete household.
-
-## What this benchmark is meant to measure
-
-PolicyBench is now scoped to the no-tools condition because that is the capability question of interest here: whether frontier models can estimate household-level tax-benefit outcomes from a description alone. Tool use may still matter in production systems, but it answers a different question. A benchmark centered on tool access tends to measure interface compliance and delegation quality more than unaided policy-calculation ability.
-
-That distinction matters because the failure modes in the AI-alone condition are substantive in this snapshot. Some errors are not just formatting failures or one-step schema misses; they are large quantitative mistakes on thresholds, phase-outs, nonlinear interactions, and state-specific rules. Those are the computational limits the benchmark is intended to measure.
-
-## Implications for AI-assisted policy analysis
-
-These results suggest design considerations for AI systems that provide policy analysis:
-
-1. **Computation may need maintained tools.** These results do not support relying on unaided large language model (LLM) memory for precise tax and benefit calculations. Microsimulation engines like PolicyEngine are designed to handle this complexity and encode statutory rules in auditable software.
-
-2. **Models may add value as interfaces.** In policy analysis systems, an LLM can translate natural language questions into structured API calls, interpret results for non-technical users, and synthesize findings across multiple scenarios. PolicyBench does not evaluate those tool-using workflows directly.
-
-3. **Benchmarks can distinguish capability from system design.** A no-tools benchmark is useful for measuring unaided model capability. A separate system benchmark can measure how well models use tools in practice. Conflating the two makes the headline harder to interpret.
-
-4. **Household realism affects interpretation.** Sampling benchmark cases from realistic microdata can capture interactions between filing status, age structure, income mix, and household composition. Benchmarks built from independently sampled attributes may test unrealistic combinations as well as policy difficulty.
-
-## Limitations
-
-Several limitations qualify these findings:
-
-**Scope of outputs.** PolicyBench evaluates selected tax and benefit outputs, not the full tax-benefit system. The headline scope focuses on person- or household-facing net-income components and selected coverage flags. Coverage flags are binary in the main ranking; PolicyEngine value proxies are used only in the secondary household-equal impact score. PolicyEngine variables may be native to lower-level entities, but headline outputs are either expanded to the people shown in the prompt or aggregated to the household before scoring. The US federal income-tax output is represented by tax before refundable credits and refundable credits; net federal income tax excluding the Affordable Care Act (ACA) Premium Tax Credit (PTC) can be derived from those two outputs. Intermediate tax bases are excluded. Model performance may differ on outputs not included in the benchmark.
-
-**Household complexity.** Sampling from the Enhanced Current Population Survey (CPS) preserves observed household structure, but the benchmark still uses a filtered subset of households so that cases remain promptable and interpretable. More complex multi-tax-unit households, itemized-deduction-heavy filers, and unusual household structures remain underrepresented.
-
-**Single policy period.** Evaluations use US tax year 2026 and UK fiscal year
-2026-27. Model performance may differ for historical years, where training data
-is more abundant, or future years, where models must extrapolate from known
-rules.
-
-**Open public set.** The site exposes the current household prompts, model
-outputs, explanations, and PolicyEngine reference outputs for transparency.
-That makes the public leaderboard an open-set benchmark, not a protected
-held-out test set. Later models or prompts could benefit from released cases,
-so open-set leakage is a prominent limitation of public rankings.
-
-**Global score weighting.** The global score gives equal weight to the US and
-UK country scores for models run in both countries. It is an equal-country
-summary for this benchmark sample, not a universal ranking of model quality.
-Country-only and alternative-weighting views should be read alongside it.
-
-**UK data provenance.** The current public UK path uses a calibrated transfer
-dataset rather than restricted native UK survey microdata. It supports a
-transparent public UK track, but it is not equivalent to enhanced Family
-Resources Survey (FRS)-based microdata and is not designed for
-population-representative UK household claims.
-
-**Prompt sensitivity.** We use a single prompt template per condition. Model performance may be sensitive to prompt phrasing, particularly in the AI-alone condition where chain-of-thought prompting or structured reasoning might improve accuracy.
-
-**Model versions.** AI model capabilities change rapidly. Results for specific model versions may not generalize to future releases.
-
-## Future work
-
-Several extensions of PolicyBench would improve interpretation:
-
-**Additional country tracks.** PolicyEngine supports Canadian and other tax-benefit systems. Extending PolicyBench beyond the current US and UK tracks would test whether models' computational limitations are specific to these systems or are more general.
-
-**Specialized policy models.** PolicyBench provides a natural evaluation framework for testing whether future domain-adapted policy models improve unaided performance on tax and benefit calculation, not just general-language reasoning.
-
-**Dynamic scenarios.** Current scenarios are static household snapshots. Future versions could test models on reform scenarios (e.g., "What would this household's SNAP benefits be if the maximum allotment increased by 10%?"), which require understanding both baseline rules and the proposed change.
-
-**Multi-turn evaluation.** Real-world policy analysis often involves iterative questioning: a user asks about one variable, then follows up about related variables or alternative scenarios. Evaluating models in multi-turn settings would better reflect actual use cases.
diff --git a/docs/index.md b/docs/index.md
index 4dd12f3..f81322b 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,18 +1,29 @@
---
-title: "PolicyBench: Can AI models calculate tax and benefit outcomes?"
+title: "PolicyBench"
---
-# PolicyBench: Can AI models calculate tax and benefit outcomes?
+# PolicyBench
**Max Ghenis** (PolicyEngine)
-## Abstract
+PolicyBench evaluates whether frontier language models can estimate household
+tax and benefit outputs from household facts without tools. The canonical
+manuscript is maintained in [`paper/index.qmd`](https://github.com/PolicyEngine/policybench/blob/main/paper/index.qmd)
+and rendered to [`app/public/paper/policybench.pdf`](https://policybench.org/paper/policybench.pdf)
+and [`app/public/paper/web/`](https://policybench.org/paper/).
-Large language models have absorbed information about tax codes, benefit programs, and policy rules, yet translating this knowledge into quantitative household outputs remains difficult. PolicyBench is a public no-tool benchmark for selected person- and household-facing tax and benefit outputs in the US and UK. We test frontier models on sampled household scenarios evaluated under US tax year 2026 and UK fiscal year 2026-27 rules and scored against PolicyEngine reference outputs.
+This documentation site is a thin index for operational material that does
+not belong in the manuscript:
-The benchmark focuses on a single condition: AI alone, where models must rely on their parametric knowledge to estimate policy outcomes from a household description. US scenarios are sampled from Enhanced Current Population Survey (CPS) households. The public UK path uses a calibrated transfer dataset, not restricted native UK survey microdata. PolicyEngine generates the benchmark reference outputs for each described case.
+- [Benchmark card](benchmark_card.md): scope, response contract, snapshot policy,
+ naming discipline, and minimum reporting standard.
+- [Results runbook](results.md): CLI invocations, output artefacts, and
+ interpretation notes for local runs.
+- [Reading the paper](paper.md): how the rendered manuscript, snapshot
+ artefacts, and reproducibility manifests fit together.
-These findings are intended to measure model capability, not tool compliance. The public scenario explorer exposes prompts and reference outputs, so the public leaderboard is open-set rather than a protected held-out evaluation. The central question is therefore specific: how much household-level tax-benefit calculation ability frontier models show in this public no-tool benchmark when they do not have access to external computation.
+The live leaderboard, scenario explorer, and per-model failure-mode views are
+hosted at .
```{tableofcontents}
```
diff --git a/docs/introduction.md b/docs/introduction.md
deleted file mode 100644
index f5e0538..0000000
--- a/docs/introduction.md
+++ /dev/null
@@ -1,39 +0,0 @@
----
-title: Introduction
----
-
-# Introduction
-
-## The promise and peril of AI for policy analysis
-
-Artificial intelligence is increasingly invoked as a tool for public policy analysis. Large language models (LLMs) can summarize legislation, explain eligibility rules, and draft policy memos. Policymakers, journalists, and researchers have begun using these models to answer questions about how tax and benefit systems affect specific households --- questions like "How much would this family receive in Supplemental Nutrition Assistance Program (SNAP) benefits?" or "What is the marginal tax rate for a single parent earning $40,000 in California?"
-
-These questions have reference outputs under a specified tax-benefit model. The US tax code and benefit programs define formulas, phase-out schedules, income thresholds, and interaction effects that together determine a household's tax liability, credit amounts, and benefit eligibility. Estimating these outputs requires knowledge of individual program rules and the ability to execute multi-step calculations that account for interactions across programs, state-specific provisions, and household-specific circumstances.
-
-LLMs are trained on tax law, IRS publications, benefit program documentation, and policy analyses. They can often describe the rules governing a program in considerable detail. But describing rules and computing outcomes from those rules are fundamentally different tasks. The question motivating this paper is whether frontier AI models can bridge that gap --- whether their parametric knowledge of policy rules translates into accurate quantitative outputs for specific household scenarios.
-
-## Why precision matters
-
-Policy analysis is a domain where approximate answers can be worse than no answer at all. Consider a family evaluating whether to accept a raise that might push them above a benefit cliff, a tax preparer estimating a client's refundable credits, or a researcher modeling the distributional effects of a proposed reform. In each case, errors of even a few hundred dollars can lead to materially wrong conclusions.
-
-The stakes are compounded by the complexity of the US tax-and-benefit system. Federal income tax alone involves multiple filing statuses, bracket structures, deductions, exemptions, and credits --- each with its own phase-in and phase-out schedules. Layered on top are state income taxes (with their own brackets and rules), means-tested benefits like SNAP and Supplemental Security Income (SSI) (with asset tests, income disregards, and categorical eligibility rules), and tax credits like the Earned Income Tax Credit (EITC) and Child Tax Credit (CTC) (with earned income requirements, child age limits, and investment income thresholds). These programs interact in ways that create effective marginal tax rates that are discontinuous, non-monotonic, and difficult to compute even for domain experts.
-
-Microsimulation models exist precisely to handle this complexity. PolicyEngine encodes program logic and generates reference outputs for specified household configurations. The question is whether AI models, armed only with their training data, can approximate these computations well enough to be useful.
-
-## Prior work
-
-Benchmarking AI models on quantitative reasoning tasks is a well-established area. Mathematical reasoning benchmarks like GSM8K {cite}`cobbe2021gsm8k` and MATH {cite}`hendrycks2021math` evaluate models on multi-step arithmetic and algebraic problems. Domain-specific benchmarks exist for medical reasoning, legal analysis, and financial calculations.
-
-Benchmarks for tax and benefit computation remain limited. TaxBench evaluated LLMs on tax preparation questions but focused on qualitative understanding of tax rules rather than numerical computation for specific households. PolicyBench focuses on household-level tax liabilities, credit amounts, and benefit levels for selected scenarios across multiple programs.
-
-PolicyBench is a public no-tool benchmark for selected person- and household-facing tax and benefit outputs. It measures a combined task: given a fully specified household and a specific policy variable, how close can a model get to PolicyEngine reference outputs without external computation, while following a structured response contract?
-
-## This paper's contributions
-
-This paper makes three contributions:
-
-1. **A public benchmark for AI-assisted policy analysis.** PolicyBench defines sampled household scenarios, encodes them for PolicyEngine, and scores selected tax-and-benefit outputs against PolicyEngine reference values. The benchmark is open-source and extensible to additional countries and programs.
-
-2. **An empirical evaluation of frontier model capabilities.** We test a dated snapshot of frontier and budget models in a no-tools setting across the US and UK. Our results quantify how much household-level policy calculation these models can do from parametric knowledge alone.
-
-3. **Evidence about the limits of unaided policy calculation.** We show where frontier models systematically fail on thresholds, phase-outs, state variation, and program interactions, and provide a benchmark that can track whether future model generations close that gap.
diff --git a/docs/methodology.md b/docs/methodology.md
deleted file mode 100644
index fb0a6a8..0000000
--- a/docs/methodology.md
+++ /dev/null
@@ -1,183 +0,0 @@
----
-title: Methodology
----
-
-# Methodology
-
-## Experimental design
-
-PolicyBench evaluates frontier AI models on a no-tools task: given a fully
-specified household and a named set of policy variables, produce the requested
-outputs without tools.
-
-- **AI alone.** The model receives a natural language description of the
- household and must estimate the requested outputs using only its parametric
- knowledge. No tools, APIs, or reference materials are provided.
-
-The benchmark is intentionally scoped to direct no-tools capability. It is
-designed to measure whether models can estimate household-level PolicyEngine
-reference outputs from the information in the prompt, not whether they can
-comply with an external calculator interface.
-
-## Models tested
-
-The default benchmark model registry tracks the currently published no-tools
-leaderboard. Published paper claims should identify the exact dated result
-export, model set, household sample, output set, and policy period used for the
-manuscript.
-
-Models are prompted to return numeric outputs plus one short non-empty
-explanation per output under a structured response contract. Scores use the
-numeric outputs; explanations are retained for qualitative review and error
-analysis.
-The same country-level prompt template is used across models. Models receive
-the same household facts and requested outputs, no tools or web access, and the
-prompt states that unlisted numeric inputs are zero while unlisted boolean or
-status facts are false.
-
-## Programs evaluated
-
-Benchmark outputs are specified in `policybench/benchmark_specs.json`. The
-active benchmark defaults to `headline`, which focuses the ranking on
-person- or household-facing outputs that contribute to household net income.
-PolicyEngine variables may be native to lower-level entities; the benchmark
-contract either expands them to the people shown in the prompt or aggregates
-them to the household before scoring. Intermediate tax bases and payroll
-subcomponents are excluded from the public ranking. Coverage eligibility
-outputs are booleans and are weighted by PolicyEngine dollar-value proxies in
-the household-equal impact score.
-
-The US headline scope evaluates direct net-income components, health-related
-support, and coverage flags:
-
-| Variable | Description | Category |
-|:---------|:-----------|:---------|
-| `federal_income_tax_before_refundable_credits` | Federal income tax after nonrefundable credits and before refundable credits | Federal tax |
-| `federal_refundable_credits` | Federal refundable income tax credits | Federal tax |
-| `payroll_tax` | Payroll tax on wages | Payroll tax |
-| `self_employment_tax` | Self-employment tax | Payroll tax |
-| `state_income_tax_before_refundable_credits` | State income tax before refundable credits | State tax |
-| `state_refundable_credits` | State refundable income tax credits | State tax |
-| `local_income_tax` | Local income tax liability | Local tax |
-| `snap` | Supplemental Nutrition Assistance Program (SNAP) annual benefit | Benefits |
-| `ssi` | Supplemental Security Income (SSI) | Benefits |
-| `tanf` | Temporary Assistance for Needy Families (TANF) benefit amount | Benefits |
-| `premium_tax_credit` | Affordable Care Act (ACA) Marketplace premium assistance | Health |
-| `person_wic_eligible` | Expanded to one Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) eligibility flag per person in the household | Coverage |
-| `person_medicaid_eligible` | Expanded to one Medicaid eligibility flag per person in the household | Coverage |
-| `person_chip_eligible` | Expanded to one Children's Health Insurance Program (CHIP) eligibility flag per person in the household | Coverage |
-| `person_medicare_eligible` | Expanded to one Medicare eligibility flag per person in the household | Coverage |
-| `person_head_start_eligible` | Expanded to one Head Start eligibility flag per person in the household | Coverage |
-| `person_early_head_start_eligible` | Expanded to one Early Head Start eligibility flag per person in the household | Coverage |
-| `free_school_meals_eligible` | Household qualifies for free school meals | Coverage |
-| `reduced_price_school_meals_eligible` | Household qualifies for reduced-price school meals | Coverage |
-
-The UK headline scope evaluates:
-
-| Variable | Description | Category |
-|:---------|:-----------|:---------|
-| `income_tax` | Income Tax liability | Tax |
-| `national_insurance` | National Insurance contributions | Tax |
-| `capital_gains_tax` | Capital Gains Tax liability | Tax |
-| `child_benefit` | Child Benefit amount before the High Income Child Benefit Charge | Benefits |
-| `universal_credit` | Universal Credit amount | Benefits |
-| `pension_credit` | Pension Credit amount | Benefits |
-| `pip` | Personal Independence Payment (PIP) amount | Benefits |
-
-The US federal tax decomposition is intentionally compact: final federal income
-tax excluding the Affordable Care Act (ACA) Premium Tax Credit (PTC) should
-equal tax before refundable credits minus federal refundable credits. ACA PTC
-is kept as a separate health-related output because it depends on Marketplace
-premium assistance rather than only income-tax credit sequencing. Binary
-coverage outputs are scored with classification accuracy rather than dollar
-error metrics.
-
-The benchmark excludes intermediate tax bases, payroll subcomponents, and
-outputs that mainly require unavailable household history, restricted local
-market data, or program take-up assignment rather than rule calculation. ACA
-Premium Tax Credit is retained as a deliberate health-support output; when
-local benchmark premiums are not listed, the model must estimate them from the
-household facts. WIC is requested as person-level eligibility, not as a dollar
-amount; WIC dollar values are used only as impact-weight proxies for coverage
-flags.
-
-## Household scenarios
-
-US scenarios are sampled from Enhanced Current Population Survey (CPS)
-households with a fixed random seed for reproducibility. To keep household
-descriptions faithful and tractable, we restrict sampled cases to households
-with a single federal tax unit, a single family, and a single
-benefit-calculation unit. Adult dependents remain in scope when they satisfy
-those restrictions. We carry through ages, household roles, employment
-patterns, and selected non-wage income sources, but do not provide filing
-status in the prompt.
-
-The public UK path samples from the UK-calibrated transfer dataset rather than
-restricted native UK survey microdata. The current UK benchmark keeps
-households with one benefit unit and one or two adults. The prompt states that
-all listed people are in the same UK benefit unit; if two adults are listed,
-they are the couple in that benefit unit. This keeps Universal Credit, Pension
-Credit, Child Benefit, Income Tax, and National Insurance prompts aligned with
-the household structure used by PolicyEngine-UK. The public transfer path is
-intended for benchmark transparency and reproducibility, not for
-population-representative claims about UK households.
-
-US scenarios are converted to PolicyEngine-US household JSON objects specifying
-people, tax units, Supplemental Poverty Measure (SPM) units, families, and
-households. Tax-unit role flags are included in the PolicyEngine input so the
-reference calculation can infer filing status from the same household structure
-described to the model. UK scenarios are converted to PolicyEngine-UK inputs
-using the public transfer dataset's person and benefit-unit structure, with
-prompts limited to one household group for tax and benefit calculations.
-
-## Reference-output computation
-
-Reference output values are computed using PolicyEngine-US and PolicyEngine-UK.
-For each sampled scenario and selected output, we run a PolicyEngine simulation
-for US tax year 2026 or UK fiscal year 2026-27 and record the computed value.
-
-PolicyEngine is the benchmark reference source. Its calculations implement
-policy rules and are maintained as open-source microsimulation models. Any
-discrepancy between a model's output and the PolicyEngine value is treated as a
-model error relative to the benchmark reference output, not as a claim about
-administrative records.
-
-## Evaluation metrics
-
-The public leaderboard uses a bounded `0-100` score. For amount outputs, each
-prediction receives credit for exact match and for falling within 1%, 5%, and
-10% of the PolicyEngine reference value; the output score averages those hit
-rates. For binary outputs, the score is classification accuracy after numeric
-predictions are rounded to `0` or `1`. Person-level coverage rows are averaged
-within their program group before country-level aggregation. Missing or
-unparseable answers count as misses through the coverage multiplier. The
-country score gives each output group equal weight, and the global score gives
-the US and UK equal weight for models run in both countries. The global score
-is an equal-country summary for this benchmark design, not a universally
-authoritative model ranking. Alternative country-only, amount-only,
-positive-reference-case, zero-reference-case, and household-equal impact views
-should be reported when making public claims.
-
-We also report diagnostic error metrics. **Mean absolute error (MAE)** measures
-the average magnitude of errors in currency terms. For a set of $n$ predictions
-$\hat{y}_i$ against reference values $y_i$:
-
-$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|\hat{y}_i - y_i|$$
-
-**Mean absolute percentage error (MAPE)** measures relative error, excluding
-cases where the reference value is zero:
-
-$$\text{MAPE} = \frac{1}{|S|}\sum_{i \in S}\left|\frac{\hat{y}_i - y_i}{y_i}\right|, \quad S = \{i : y_i \neq 0\}$$
-
-**Within-10% accuracy** measures the fraction of predictions that fall within
-10% of the reference value. For zero reference values, we instead check whether
-the prediction is within `$1` of zero:
-
-$$\text{Acc}_{10\%} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\left[(y_i = 0 \land |\hat{y}_i| \leq 1) \lor (y_i \neq 0 \land \frac{|\hat{y}_i - y_i|}{|y_i|} \leq 0.10)\right]$$
-
-For binary variables, we report classification accuracy.
-
-Sensitivity checks should be reported alongside paper claims, including
-amount-only, binary-only, positive-reference-case, zero-reference-case,
-country-only, and household-equal impact-score rankings when the relevant
-artifacts are available.
diff --git a/docs/myst.yml b/docs/myst.yml
index f9bb193..f4a3763 100644
--- a/docs/myst.yml
+++ b/docs/myst.yml
@@ -1,6 +1,6 @@
version: 1
project:
- title: "PolicyBench: Can AI models calculate tax and benefit outcomes?"
+ title: "PolicyBench operational documentation"
authors:
- name: Max Ghenis
affiliations:
@@ -14,15 +14,9 @@ project:
license: MIT
toc:
- file: index
- - file: introduction
- - file: methodology
+ - file: paper
+ - file: benchmark_card
- file: results
- - file: discussion
- - file: references
- exports:
- - format: pdf
- template: plain_latex_book
- output: exports/policybench.pdf
site:
- title: "PolicyBench: Can AI models calculate tax and benefit outcomes?"
+ title: "PolicyBench operational documentation"
diff --git a/docs/paper.md b/docs/paper.md
new file mode 100644
index 0000000..cce3542
--- /dev/null
+++ b/docs/paper.md
@@ -0,0 +1,50 @@
+---
+title: Reading the paper
+---
+
+# Reading the paper
+
+The canonical PolicyBench manuscript is the Quarto source at
+[`paper/index.qmd`](https://github.com/PolicyEngine/policybench/blob/main/paper/index.qmd).
+It builds against:
+
+- `app/src/data.json` — the frozen dashboard export with model summaries,
+ program summaries, scenario predictions, prompts, and PolicyEngine runtime
+ bundle metadata.
+- `paper/snapshot/20260501/` — the dated snapshot directory with
+ scenarios, reference outputs, impact summaries, run-level artefacts under
+ `runs/`, the rendered PDF/web manuscript hashes, and the
+ `manifest.json` provenance index.
+
+## Rendered outputs
+
+- PDF: [`app/public/paper/policybench.pdf`](https://policybench.org/paper/policybench.pdf)
+- Web: [`app/public/paper/web/`](https://policybench.org/paper/)
+- Both rendered artefacts are sha256-pinned in
+ `paper/snapshot/20260501/manifest.json` under `rendered_paper_artifacts`.
+
+## What to cite
+
+For methodology, scope, response contract, scoring rule, and limitations, cite
+[`paper/index.qmd`](https://github.com/PolicyEngine/policybench/blob/main/paper/index.qmd)
+at the snapshot date. The `docs/` site does not duplicate the manuscript
+prose; it only carries the operational runbook ([`results.md`](results.md))
+and the normative benchmark card
+([`benchmark_card.md`](benchmark_card.md)).
+
+## Reproducibility checklist
+
+The manifest at `paper/snapshot/20260501/manifest.json` lists:
+
+- the dashboard export and snapshot CSV hashes
+- the per-run compact artefacts (`runs//`) including
+ `predictions.csv.gz` with raw provider responses
+- the rendered PDF and web bundle hashes
+- the UK calibrated transfer dataset's pinned commit, public URL, and sha256
+- reproducibility notes covering model-alias instability and what is not
+ retained locally (LiteLLM cache, since it is a generated request cache)
+
+A third party can verify the leaderboard numbers against the committed
+`ground_truth.csv` files without rerunning the benchmark, and can rerun the
+benchmark by pointing `policybench eval-no-tools-chunked` at the same
+scenarios.
diff --git a/docs/references.md b/docs/references.md
deleted file mode 100644
index abc1f6f..0000000
--- a/docs/references.md
+++ /dev/null
@@ -1,12 +0,0 @@
----
-title: References
----
-
-# References
-
-```{bibliography}
-```
-
-The paper bibliography is maintained in
-[`paper/references.bib`](../paper/references.bib). Public claims should cite the
-rendered paper or that BibTeX file rather than this documentation stub.
diff --git a/docs/site_improvements_scope.md b/docs/site_improvements_scope.md
new file mode 100644
index 0000000..9afbfb6
--- /dev/null
+++ b/docs/site_improvements_scope.md
@@ -0,0 +1,216 @@
+---
+title: PolicyBench.org improvements scope
+---
+
+# PolicyBench.org improvements scope
+
+A scoping document for improvements to the PolicyBench dashboard at
+[policybench.org](https://policybench.org). The current site is a Next.js App
+Router application under `app/` rendering a single dashboard
+(`app/src/App.tsx`) with five sections — model leaderboard, scenario
+explorer, failure modes, program heatmap, methodology — plus a `/paper`
+landing page that embeds the rendered manuscript. This document does not
+change any code; it lists what would meaningfully improve the experience
+for the audiences that visit the site, ranked by impact.
+
+## Audiences
+
+1. **Researchers** comparing model performance over time and across views.
+2. **Practitioners** deciding whether to trust an LLM with a household
+ tax/benefit question.
+3. **Provider teams and benchmarking groups** auditing results for their
+ own model.
+4. **Press and policy commentators** referencing the headline number.
+
+The site already serves audience 1 well at the leaderboard level.
+Audiences 2–4 are underserved: there is no per-model deep dive, no
+sensitivity-view selector, no per-scenario explanation browsing, and no
+honest open-set warning beyond the paper.
+
+## Tier 1 — high impact, ship soon
+
+### 1.1 Open-set leakage banner and snapshot freshness indicator
+**Why.** The paper now states clearly that the public leaderboard is
+open-set. The site does not. A casual visitor reads the leaderboard as a
+held-out evaluation. A small banner near the leaderboard header (`Sources
+released publicly. Open-set leaderboard — see methodology.`) plus a
+snapshot date pill (`Snapshot 2026-05-01`) close that gap.
+
+**Where.** `app/src/components/Hero.tsx` — add an `` next
+to the title or under the subtitle. `app/src/components/ModelLeaderboard.tsx`
+— add a one-line caveat above the table.
+
+**Effort.** Half a day.
+
+### 1.2 Sensitivity-view selector on the leaderboard
+**Why.** The paper reports eight sensitivity views (amount-only,
+binary-only, positive-only, zero-only, country-only, impact-weighted, US
+binary coverage). The dashboard only shows the main equal-output view.
+Surfacing the same selector on the site lets visitors verify rank
+stability themselves rather than taking the headline as final.
+
+**Where.** New control in `ModelLeaderboard.tsx` (segmented control above
+the table); the underlying score variants either get pre-computed and
+stored in `data.json` or computed client-side from the `scenarioPredictions`
+already present.
+
+**Effort.** Two days. Pre-computation in `analysis.build_dashboard_payload`
+is the cleaner path because it keeps the client small and the math
+canonical.
+
+### 1.3 Bootstrap-rank intervals next to model scores
+**Why.** The paper computes household-resampling 95% intervals and
+rank-ranges. The site shows a single point estimate. Rendering the rank
+range (`Rank 2 (CI: 1–4)`) tempers overinterpretation of small gaps.
+
+**Where.** `ModelLeaderboard.tsx`. Pre-compute in
+`analysis.build_dashboard_payload` so the client just renders.
+
+**Effort.** One to two days.
+
+### 1.4 Per-model deep-dive page
+**Why.** A provider team wanting to audit `gpt-5.4-mini` has no entry
+point. A per-model page (`/model/[id]`) showing the model's score, top
+errors, hardest variables, parse coverage, raw response examples, and a
+comparison to the next-best model on each output addresses the audit use
+case directly.
+
+**Where.** New App Router route `app/src/app/model/[model]/page.tsx`
+consuming the existing `scenarioPredictions` and `failureModes` data.
+
+**Effort.** Three to four days.
+
+## Tier 2 — meaningful impact, ship next
+
+### 2.1 Per-output deep-dive page
+**Why.** Symmetric to per-model: a researcher interested in SNAP wants
+"who gets it right, who gets it wrong, on which households". Currently the
+program heatmap collapses this to one cell per (variable, model).
+
+**Where.** `app/src/app/output/[variable]/page.tsx` consuming
+`programStats`, `scenarioPredictions`, and `failureModes`.
+
+**Effort.** Two days.
+
+### 2.2 Cross-country compare view
+**Why.** When the global view is selected, models that exist in both
+countries should be compared side-by-side per output where the output
+exists in both (e.g., income tax). Currently the user has to switch tabs
+and remember numbers.
+
+**Where.** New section in the global view of `App.tsx`, or extend
+`ModelLeaderboard` with a "show country split" toggle.
+
+**Effort.** One to two days.
+
+### 2.3 Cost and token usage on the leaderboard
+**Why.** `predictions.csv.gz` carries `prompt_tokens`, `completion_tokens`,
+`reasoning_tokens`, and `provider_reported_cost_usd`. The dashboard does
+not surface any of this. Practitioners deciding whether to use a model
+care about cost-per-correct-answer as much as raw accuracy.
+
+**Where.** Aggregate per-model usage in
+`analysis.build_dashboard_payload` and add a `usageStats` block to
+`modelStats`. Render as a cost column in `ModelLeaderboard.tsx` with a
+`$/100 outputs` framing and an opt-in toggle to switch the score axis to
+score-per-dollar.
+
+**Effort.** Two days.
+
+### 2.4 Scenario filtering and search
+**Why.** Today's scenario explorer picks a random scenario or accepts a
+URL hash. There is no way to filter by state, age, income range, or
+filing status. Audiences 2 and 3 want to find "households like mine" or
+"the failures that matter to my product".
+
+**Where.** Add a search/filter bar above the scenario explorer; index
+scenarios on a small set of facets.
+
+**Effort.** Two days.
+
+### 2.5 Federal+state joint accuracy view (US)
+**Why.** The paper now reports federal/state refundable credit joint
+within-10% accuracy as a failure mode. The site should mirror it as a
+small explainer next to the failure-mode panel: it is the cleanest
+demonstration of the "marginal accuracy hides joint errors" pattern.
+
+**Where.** `FailureModes.tsx` — add a third sub-section.
+
+**Effort.** Half a day.
+
+## Tier 3 — useful, ship when bandwidth allows
+
+### 3.1 Explanation browsing
+The dashboard already shows one model's explanation in a tooltip per
+(scenario, variable). A "show all explanations" mode that lays out
+alongside the numeric answers turns the site into a qualitative analysis
+tool.
+
+### 3.2 Citation snippet widget
+A small `Cite` button in the hero that copies a BibTeX entry for the
+current snapshot.
+
+### 3.3 Programmatic data download buttons
+`Download CSV / JSON` next to the leaderboard, scenario explorer, and
+heatmap. The data is in `data.json` already; the buttons just expose it.
+
+### 3.4 Snapshot history and changelog
+A `/changes` page listing prior snapshots, ranking diffs, and any
+methodology changes between them. Even a minimal version (table of dates
++ commit links) would help track movement over time.
+
+### 3.5 RSS/Atom feed of leaderboard updates
+For audience 1: a feed item per snapshot freeze.
+
+### 3.6 Reasoning vs non-reasoning slice
+Two of the listed Grok and Gemini variants are "reasoning" or "fast/non-
+reasoning". A toggle that groups by reasoning configuration would make
+the on-vs-off effect visible.
+
+### 3.7 Mobile leaderboard polish
+The leaderboard uses a wide table. On phones the eight columns get
+horizontal scroll. A condensed mobile rendering (model name, score,
+within-10%) plus expand-on-tap would help.
+
+## Tier 4 — speculative or larger investments
+
+### 4.1 Held-out protected leaderboard
+The biggest credibility limitation is open-set status. A held-out
+evaluation set (rotating monthly, prompts not in training corpora) would
+allow a separate "Protected" rank tab. Requires policy decisions on
+release cadence and a separate ingestion flow; not just a frontend
+change.
+
+### 4.2 Live evaluation against new models
+A "Run my model" submission flow with a sandboxed evaluation pipeline.
+Operationally complex (sandboxed inference, cost accounting, abuse
+controls) but the most common ask from audience 3.
+
+### 4.3 Country expansion previews
+PolicyEngine supports Canada and Israel. A "preview" tab listing the
+intended next country tracks signals roadmap to audience 1.
+
+### 4.4 Embedded reform-scenario explorer
+Today the benchmark scores baseline households. A reform-aware variant
+("What does each model think a 10% SNAP boost does to this household?")
+tests the marginal-effect ability the discussion section flags as future
+work.
+
+## Out-of-scope or reject
+
+- **Branded scoreboards per provider.** The site is policy-neutral; per-
+ provider marketing pages drift from that.
+- **Anything that changes the canonical numbers post-snapshot.** Live
+ re-runs against the same models would invalidate the manuscript
+ reference. Live evaluation must target a separate held-out set.
+- **Login or accounts.** The site's value is open access; auth is a
+ cost without a clear win.
+
+## Suggested first PR
+
+Combine **1.1 (open-set banner)**, **1.2 (sensitivity selector)**, and
+**1.3 (bootstrap rank intervals)** into a single PR. They share a small
+amount of new pre-computed payload in `analysis.build_dashboard_payload`,
+they do not require new routes, and together they shift the leaderboard
+from "headline" to "honest defended ranking" — which is the change with
+the largest credibility win for the smallest engineering cost.
diff --git a/paper/index.qmd b/paper/index.qmd
index 6118ef9..c305680 100644
--- a/paper/index.qmd
+++ b/paper/index.qmd
@@ -38,8 +38,15 @@ SNAPSHOT_EXPORT = "app/src/data.json"
US_RUN_LABEL = "us_100_stable_models_20260501_062249"
UK_RUN_LABEL = "uk_100_hicbc_fixed_20260501_083402"
IMPACT_SNAPSHOT_DIR = "paper/snapshot/20260501"
-UK_TRANSFER_ARTIFACT = "enhanced_cps_2025.h5 public UK calibrated transfer artifact"
+UK_TRANSFER_ARTIFACT = "enhanced_cps_2025.h5"
UK_TRANSFER_ARTIFACT_SHA256 = "199ebc61d29231b4799ad337a95393765b5fb5aede1834b93ff2acecceded866"
+UK_TRANSFER_ARTIFACT_REPO = "PolicyEngine/policyengine-uk-data"
+UK_TRANSFER_ARTIFACT_PINNED_COMMIT = "9514dfb7ec607897c9f7122a2e073b922c9fd8b6"
+UK_TRANSFER_ARTIFACT_PINNED_URL = (
+ "https://raw.githubusercontent.com/PolicyEngine/policyengine-uk-data/"
+ f"{UK_TRANSFER_ARTIFACT_PINNED_COMMIT}/policyengine_uk_data/storage/"
+ f"{UK_TRANSFER_ARTIFACT}"
+)
def file_hash_prefix(path: str) -> str:
@@ -183,6 +190,138 @@ def impact_scores() -> pd.DataFrame:
return pd.concat(rows, ignore_index=True)
+def federal_state_joint_accuracy() -> pd.DataFrame:
+ """Within-10% accuracy for federal vs state refundable credits and the joint."""
+ predictions = country_payload("us")["scenarioPredictions"]
+ rows = []
+ for scenario_id, variables in predictions.items():
+ if (
+ "federal_refundable_credits" not in variables
+ or "state_refundable_credits" not in variables
+ ):
+ continue
+ federal = variables["federal_refundable_credits"]
+ state = variables["state_refundable_credits"]
+ for model in federal:
+ if model not in state:
+ continue
+ rows.append(
+ {
+ "scenario_id": scenario_id,
+ "model": model,
+ "fed_truth": federal[model]["groundTruth"],
+ "fed_pred": federal[model].get("prediction"),
+ "state_truth": state[model]["groundTruth"],
+ "state_pred": state[model].get("prediction"),
+ }
+ )
+ frame = pd.DataFrame(rows)
+
+ def hit_within_10(truth: float, pred: float | None) -> bool:
+ if pred is None or pd.isna(pred):
+ return False
+ if truth == 0:
+ return abs(pred) <= 1.0
+ return abs(pred - truth) / abs(truth) <= 0.10
+
+ frame["fed_hit"] = [
+ hit_within_10(truth, pred)
+ for truth, pred in zip(frame["fed_truth"], frame["fed_pred"], strict=True)
+ ]
+ frame["state_hit"] = [
+ hit_within_10(truth, pred)
+ for truth, pred in zip(
+ frame["state_truth"], frame["state_pred"], strict=True
+ )
+ ]
+ frame["both_hit"] = frame["fed_hit"] & frame["state_hit"]
+ summary = (
+ frame.groupby("model")[["fed_hit", "state_hit", "both_hit"]]
+ .mean()
+ .reset_index()
+ .sort_values("both_hit", ascending=False)
+ )
+ summary["fed_hit"] = (summary["fed_hit"] * 100).round(1)
+ summary["state_hit"] = (summary["state_hit"] * 100).round(1)
+ summary["both_hit"] = (summary["both_hit"] * 100).round(1)
+ summary.columns = [
+ "Model",
+ "Federal within 10%",
+ "State within 10%",
+ "Joint within 10%",
+ ]
+ return summary
+
+
+def impact_floor_sensitivity() -> pd.DataFrame:
+ """Top US/UK models under varying household-equal impact-score floors.
+
+ Uses the canonical ``policybench.analysis.household_equal_impact_scores``
+ over the committed snapshot ground-truth and per-run prediction CSVs so
+ the table matches the same impact statistic used in
+ ``us_impact_summary_by_model.csv`` and ``uk_impact_summary_by_model.csv``.
+ """
+ from policybench.analysis import household_equal_impact_scores
+
+ run_paths = {
+ "us": ROOT / IMPACT_SNAPSHOT_DIR / "runs" / US_RUN_LABEL,
+ "uk": ROOT / IMPACT_SNAPSHOT_DIR / "runs" / UK_RUN_LABEL,
+ }
+ ground_truths: dict[str, pd.DataFrame] = {}
+ predictions: dict[str, pd.DataFrame] = {}
+ for country, run_dir in run_paths.items():
+ ground_truths[country] = pd.read_csv(run_dir / "ground_truth.csv")
+ predictions[country] = pd.read_csv(
+ run_dir / "predictions.csv.gz",
+ usecols=["model", "scenario_id", "variable", "prediction"],
+ )
+
+ def country_table(floor_share: float) -> pd.DataFrame:
+ country_rows = []
+ for country in ["us", "uk"]:
+ scenarios = household_equal_impact_scores(
+ ground_truths[country],
+ predictions[country],
+ floor_share=floor_share,
+ )
+ if scenarios.empty:
+ continue
+ country_summary = (
+ scenarios.groupby("model")["impact_score"].mean().reset_index()
+ )
+ country_summary["score"] = country_summary["impact_score"] * 100
+ country_summary["country"] = country
+ country_rows.append(country_summary[["country", "model", "score"]])
+ if not country_rows:
+ return pd.DataFrame(columns=["country", "model", "score"])
+ return pd.concat(country_rows, ignore_index=True)
+
+ rows = []
+ for floor in [0.0, 0.1, 0.3, 0.5, 1.0]:
+ country_summary = country_table(floor)
+ # Skip floors where any required country is missing or fewer than three
+ # global rows are available, rather than relying on a single-country
+ # fallback inside global_scores or crashing on iloc out-of-bounds.
+ countries_present = set(country_summary["country"].unique())
+ if not {"us", "uk"} <= countries_present:
+ continue
+ global_summary = global_scores(country_summary)
+ if len(global_summary) < 3:
+ continue
+ rank1 = global_summary.iloc[0]
+ rank2 = global_summary.iloc[1]
+ rank3 = global_summary.iloc[2]
+ rows.append(
+ {
+ "Floor": f"{floor:.1f}",
+ "Rank 1": f"{rank1.model} ({rank1.score:.1f})",
+ "Rank 2": f"{rank2.model} ({rank2.score:.1f})",
+ "Rank 3": f"{rank3.model} ({rank3.score:.1f})",
+ }
+ )
+ return pd.DataFrame(rows)
+
+
def simple_baselines() -> pd.DataFrame:
prediction_rows = flatten_predictions()
baselines = []
@@ -453,6 +592,8 @@ sensitivity = sensitivity_rows()
model_runs = model_run_table()
bootstrap_uncertainty = bootstrap_global_uncertainty()
baseline_summary = simple_baselines()
+federal_state_joint = federal_state_joint_accuracy()
+impact_floor = impact_floor_sensitivity()
snapshot_provenance = pd.DataFrame(
[
["Snapshot date", SNAPSHOT_DATE],
@@ -475,6 +616,8 @@ snapshot_provenance = pd.DataFrame(
["US data bundle", f"{bundle_version('us', 'data_package')} {bundle_version('us', 'data_version')}"],
["UK runtime bundle", f"{bundle_version('uk', 'data_package')} {bundle_version('uk', 'data_version')}"],
["UK scenario source", UK_TRANSFER_ARTIFACT],
+ ["UK scenario-source repository", UK_TRANSFER_ARTIFACT_REPO],
+ ["UK scenario-source pinned commit", UK_TRANSFER_ARTIFACT_PINNED_COMMIT],
["UK scenario-source SHA-256", UK_TRANSFER_ARTIFACT_SHA256],
["Households", "100 US and 100 UK"],
["Models", f"{len(global_model)} shared models"],
@@ -573,6 +716,8 @@ The benchmark uses a bounded `0-100` score. For amount variables, the score aver
For binary outputs, the score is exact accuracy. This keeps `100` as the ceiling while still giving partial credit for near misses on amount outputs. PolicyBench also tracks mean absolute error and related error metrics, but those are secondary to the bounded score. This choice preserves exact-match comparability while avoiding the failure mode that recent numeric-evaluation papers have criticized [@zhou2025tempanswerqa].
+Because the four thresholds are nested (exact $\subseteq$ within-1% $\subseteq$ within-5% $\subseteq$ within-10%), averaging their indicator functions is equivalent to assigning step credit by error band: predictions that match exactly receive 1.00, predictions inside 1% but not exact receive 0.75, predictions inside 5% but outside 1% receive 0.50, predictions inside 10% but outside 5% receive 0.25, and predictions outside 10% receive 0.00. The score is therefore a partial-credit rule that rewards tighter agreement more than looser agreement, not an unweighted aggregate of independent hit rates.
+
All requested US outputs are annual amounts or annual eligibility indicators for tax year 2026; UK outputs are annual amounts or annual eligibility indicators for fiscal year 2026-27. For currency amounts, the "exact" hit rate means within `1` currency unit of the reference value after numeric parsing. Percentage-threshold hit rates use relative error when the reference value is nonzero and the same `1` currency-unit absolute tolerance when the reference value is zero. Binary outputs are parsed as `0` or `1` eligibility flags and scored by exact classification accuracy. Missing or unparseable numeric answers receive zero score for that requested output.
Aggregation proceeds in four steps. First, each household-output prediction receives a `0-100` score. Second, person-level coverage rows are averaged to their program group. Third, output groups are averaged with equal weight within each country. Fourth, global scores average the US and UK country scores for models run in both countries. The global score is therefore an equal-country summary for this benchmark sample, not a universally authoritative measure of tax-benefit competence. Household-equal impact scores and other alternative views are reported as sensitivity checks.
@@ -601,7 +746,7 @@ model_runs
### United States
-The US benchmark is built from Enhanced Current Population Survey (CPS)-derived households using PolicyEngine US. The sampled households are filtered to keep a single-tax-unit structure while retaining variation in filing status, household composition, and income sources. Prompts include nonzero promptable raw inputs across relevant entities rather than a hand-curated summary, so the models see many of the same facts the simulator receives.
+The US benchmark is built from Enhanced Current Population Survey (CPS)-derived households using PolicyEngine US. The sampled households are filtered to keep a single-tax-unit, single-family, single-Supplemental Poverty Measure (SPM)-unit structure with at least one adult and a supported filing status. The 2024 Enhanced CPS source contains 41,314 households; 30,173 (73.0%) pass the filter and form the eligible draw. The 27.0% excluded by the filter include multi-tax-unit households (e.g., adult roommates), multi-family households, multi-SPM-unit households, and households whose head reports a filing status outside the supported set. These excluded compositions are exactly the kind of cases where federal/state credit allocations and benefit-unit rules become hardest, so the eligible draw is a tractable subset rather than the full distribution of US households. Prompts include nonzero promptable raw inputs across relevant entities rather than a hand-curated summary, so the models see many of the same facts the simulator receives. Filing status is not stated in the prompt; the reference computation infers it from tax-unit role flags. Models therefore see the same household facts that drive the reference filing-status assignment, but they do not receive that assignment as a label.
The current US release evaluates 19 output groups spanning federal income tax, refundable credits, payroll and self-employment tax, state and local income tax, Supplemental Nutrition Assistance Program (SNAP), Supplemental Security Income (SSI), Temporary Assistance for Needy Families (TANF), Affordable Care Act (ACA) premium tax credits, school-meal eligibility, and person-level coverage eligibility for the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), Medicaid, the Children's Health Insurance Program (CHIP), Medicare, Head Start, and Early Head Start.
@@ -615,13 +760,13 @@ scope_rationale
### United Kingdom
-The UK benchmark is built from a calibrated public transfer dataset scored through PolicyEngine UK. The current public build starts from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS, maps those records into UK-facing inputs, and recalibrates them to selected UK targets. This creates a public UK-policy transfer benchmark without publishing restricted household microdata; it is not a representative evaluation over native UK household records and should not be treated as a substitute for Family Resources Survey (FRS)-based UK microdata. The current UK release evaluates seven outputs: Income Tax, National Insurance, Capital Gains Tax, Child Benefit, Universal Credit, Pension Credit, and Personal Independence Payment (PIP). Outputs that depend on status or award facts use prompt-visible facts rather than hidden take-up labels; for example, PIP-positive cases list the daily living and mobility award components used by PolicyEngine.
+The UK benchmark is built from a calibrated public transfer dataset scored through PolicyEngine UK. The current public build starts from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS, maps those records into UK-facing inputs, and recalibrates them to selected UK targets. The resulting `enhanced_cps_2025.h5` artifact is checked in to the public PolicyEngine/policyengine-uk-data GitHub repository; the manuscript pins commit `9514dfb7ec607897c9f7122a2e073b922c9fd8b6` so that a third party can retrieve the exact file used here. The artifact contains 28,532 households; 28,502 (99.9%) pass the eligibility filter that retains households with one benefit unit and one or two adults. This creates a public UK-policy transfer benchmark without publishing restricted household microdata; it is not a representative evaluation over native UK household records and should not be treated as a substitute for Family Resources Survey (FRS)-based UK microdata. The current UK release evaluates seven outputs: Income Tax, National Insurance, Capital Gains Tax, Child Benefit, Universal Credit, Pension Credit, and Personal Independence Payment (PIP). Outputs that depend on status or award facts use prompt-visible facts rather than hidden take-up labels; for example, PIP-positive cases list the daily living and mobility award components used by PolicyEngine.
The UK data path is more synthetic than the enhanced FRS pipeline and inherits limitations from cross-country transfer, calibration choices, and the subset of variables that can be made prompt-visible. It supports the current public cross-country benchmark, but it is not equivalent to an enhanced-FRS-based benchmark and should not be used to make population-representative claims about UK households [@sutherland2023euromod].
### Reference-output credibility
-PolicyBench treats PolicyEngine outputs as benchmark reference outputs, not as administrative records. The reference source is nevertheless stronger than an ad hoc answer key: PolicyEngine is open source, used for household calculators and reform analysis, and externally checked in specific domains. In the UK, No. 10's data science team adapted PolicyEngine's open-source microsimulation model for experimental policy simulation, with validation against external projections before use [@woodruff2026no10; @policyengine2026downing]. In the US, PolicyEngine's state tax modelling has been validated against the National Bureau of Economic Research (NBER) TAXSIM model, with reported penny-level agreement for the vast majority of 2021 test cases [@policyengine2024statetax; @feenberg1993taxsim]. PolicyEngine has also signed a memorandum of understanding with the Federal Reserve Bank of Atlanta for future validation work against the Atlanta Fed's Policy Rules Database [@policyengine2025atlantafed; @atlantafed2026prd]. The Atlanta Fed sources are a caveat rather than evidence of completed validation for this benchmark: they document planned collaboration and the comparison source, not finished checks of the frozen PolicyBench outputs. Taken together, these sources support using PolicyEngine as a transparent reference implementation with partial external validation, but they do not validate every benchmark output.
+PolicyBench treats PolicyEngine outputs as benchmark reference outputs, not as administrative records. The reference source is nevertheless stronger than an ad hoc answer key: PolicyEngine is open source, used for household calculators and reform analysis, and externally checked in specific domains. In the UK, No. 10's data science team adapted PolicyEngine's open-source microsimulation model for experimental policy simulation, with validation against external projections before use [@woodruff2026no10; @policyengine2026downing]. In the US, PolicyEngine reports matching the National Bureau of Economic Research (NBER) TAXSIM-35 model to the cent on the vast majority of cases for the 2021 tax year across hundreds of thousands of tax units per state, with state-specific differences documented in PolicyEngine's integration tests [@policyengine2024statetax; @feenberg1993taxsim]. We do not restate that comparison as a single percentage because the published source uses qualitative phrasing rather than a headline accuracy number. PolicyEngine has also signed a memorandum of understanding with the Federal Reserve Bank of Atlanta for future validation work against the Atlanta Fed's Policy Rules Database [@policyengine2025atlantafed; @atlantafed2026prd]. The Atlanta Fed sources are a caveat rather than evidence of completed validation for this benchmark: they document planned collaboration and the comparison source, not finished checks of the frozen PolicyBench outputs. Taken together, these sources support using PolicyEngine as a transparent reference implementation with partial external validation, but they do not validate every benchmark output.
This does not make PolicyEngine infallible. During benchmark development, we performed a manual, developer-led discrepancy review, with LLM assistance used to triage and summarize candidate cases surfaced by model explanations. @tbl-reference-review summarizes the main discrepancy classes reviewed before the frozen snapshot. In those reviewed classes, discrepancies reflected model mistakes, prompt ambiguity later corrected, or upstream data/model issues fixed before the frozen snapshot; the reviewed discrepancy classes did not identify unresolved PolicyEngine reference-output defects. This is a qualitative development audit, not an independent sampled validation study or exhaustive validation of every reference value.
@@ -685,9 +830,17 @@ The primary leaderboard gives each country equal weight globally and each output
sensitivity
```
+The household-equal impact score weights each output group inside a household by a blend of an equal-weight floor and the absolute reference-value share. The default uses a `0.3` floor. @tbl-impact-floor reports the top three global ranks under floors of `0.0`, `0.1`, `0.3`, `0.5`, and `1.0`. Floor `1.0` collapses to the equal-output baseline; floor `0.0` is pure dollar-share weighting. Rank stability across this range means the impact view is not driven by the specific floor choice.
+
+```{python}
+#| label: tbl-impact-floor
+#| tbl-cap: "Top global ranks under varying household-equal impact-score floors."
+impact_floor
+```
+
### Sampling uncertainty
-The manuscript snapshot uses 100 households per country, so small score gaps should not be overinterpreted. @tbl-bootstrap reports a household-resampling bootstrap over the frozen sample using 1,000 percentile bootstrap draws. It resamples households within each country, recomputes country scores with equal output-group weighting, then recomputes the equal-country global score. The table is a sampling-uncertainty check for this household draw, not uncertainty over future model releases or prompt variants.
+The manuscript snapshot uses 100 households per country, so small score gaps should not be overinterpreted. @tbl-bootstrap reports a household-resampling bootstrap over the frozen sample using 1,000 percentile bootstrap draws. It resamples households within each country, recomputes country scores with equal output-group weighting, then recomputes the equal-country global score. The intervals therefore quantify uncertainty from the specific 100-household draw used here. They do not capture (i) prompt-template variance, since the manuscript uses a single template per country with no paraphrase resamples; (ii) decoding stochasticity, since each model's snapshot prediction is a single sample at the provider's default decoding settings; (iii) provider-side drift after 2026-05-01 in alias-resolved model weights; or (iv) reference-output uncertainty inside PolicyEngine. Total uncertainty over those sources is wider than the household-bootstrap intervals.
```{python}
#| label: tbl-bootstrap
@@ -739,7 +892,15 @@ First, models miss positive tax and benefit quantities more often than zero case
Second, the UK benchmark shows the same split between tax calculations and many benefit outputs. Income tax and National Insurance score below the benefit outputs in the frozen manuscript snapshot. Positive Universal Credit and Pension Credit cases remain difficult, so the result should not be read as a general claim that benefits are easy.
-Third, structured-output reliability can affect rankings. The current manuscript snapshot has full parse coverage for the included models, but the benchmark still tracks coverage because failures to return parseable numeric values should count as benchmark failures [@shorten2024structuredrag].
+Third, joint accuracy across interacting components is lower than marginal accuracy on either component. @tbl-fed-state shows within-10% accuracy for `federal_refundable_credits`, `state_refundable_credits`, and the conjunction of both within the same household. The joint hit rate is consistently lower than either marginal hit rate, so leaderboard scores that average across outputs understate how often a model gets a single household's federal/state credit allocation jointly correct.
+
+```{python}
+#| label: tbl-fed-state
+#| tbl-cap: "US within-10% accuracy on federal vs state refundable credits and the household-level joint."
+federal_state_joint
+```
+
+Fourth, structured-output reliability can affect rankings. The current manuscript snapshot has full parse coverage for the included models, but the benchmark still tracks coverage because failures to return parseable numeric values should count as benchmark failures [@shorten2024structuredrag].
## Limitations
diff --git a/paper/snapshot/20260501/manifest.json b/paper/snapshot/20260501/manifest.json
index b1b5444..c5b2d11 100644
--- a/paper/snapshot/20260501/manifest.json
+++ b/paper/snapshot/20260501/manifest.json
@@ -60,14 +60,14 @@
"rendered_paper_artifacts": {
"pdf": {
"path": "app/public/paper/policybench.pdf",
- "sha256": "d48b665ecdce99472a4e28ef4d0e3a2a058cb4e415408ec0a99c1fbbbb500805"
+ "sha256": "5773a79c1077fed38916049591293347743bfa9cb4fc270f496076ce297b7c09"
},
"web": {
"path": "app/public/paper/web",
"files": {
"figures/global_leaderboard.png": "1211ab4f6d5a806243f08b35b394152d169eb1e64b5636a67873fa43810471a2",
"figures/positive_zero_scatter.png": "784d60297b324cb0fa7fefc859b29c5cb449e9b1690d187a24cdd2a362962dbc",
- "index.html": "33505e60896cf254b578c38fb0cf40cf72c1e9bd87681ff95d8853acf9de7580",
+ "index.html": "e9fe3433eef9b9b78798c25d3c1aa32f5ab7b5a8370ca4296ac299482194d667",
"pe-tokens.css": "8f24d8da26f583c8ffddffcdcd172b6d52cbecfec20eda55bd39d7aa829f41d8",
"policybench-theme.css": "0e12c5fd615558259e5bce0167a38424e54f9ceb280666c4afd660d759cd1cb9",
"site_libs/clipboard/clipboard.min.js": "e17a1d816e13c0826e0ed7febfabc3277f45571234bde0bf9120829a7169edc9",
@@ -89,12 +89,16 @@
"reproducibility_notes": [
"The top-level frozen scenario, reference output, and impact summary CSVs are byte-identical to the corresponding compact source run artifacts copied under paper/snapshot/20260501/runs/.",
"Raw provider responses are retained in compressed source-run predictions.csv.gz files. The separate LiteLLM cache remains local-only because it is a generated request cache, not the canonical snapshot artifact.",
- "Model APIs and upstream model aliases may change after 2026-05-01, so exact reruns can diverge even with the committed household inputs, reference outputs, parsed dashboard export, and analysis summaries."
+ "Model APIs and upstream model aliases may change after 2026-05-01, so exact reruns can diverge even with the committed household inputs, reference outputs, parsed dashboard export, and analysis summaries.",
+ "Most model identifiers in policybench/config.py are provider aliases (claude-opus-4.7, gpt-5.5, gemini-3.1-pro-preview, etc.) rather than dated revisions. This snapshot's predictions.csv.gz files therefore do not include a provider system_fingerprint column. Subsequent runs collected with the post-snapshot policybench/eval_no_tools.py capture provider_response_id, provider_system_fingerprint, and provider_resolved_model so future snapshots can pin alias resolutions explicitly."
],
"uk_transfer_dataset": {
- "artifact": "enhanced_cps_2025.h5 public UK calibrated transfer artifact",
+ "artifact": "enhanced_cps_2025.h5",
"sha256": "199ebc61d29231b4799ad337a95393765b5fb5aede1834b93ff2acecceded866",
- "note": "The public paper describes this as a UK calibrated transfer dataset, not native UK survey microdata or enhanced FRS."
+ "public_repository": "https://github.com/PolicyEngine/policyengine-uk-data",
+ "pinned_commit": "9514dfb7ec607897c9f7122a2e073b922c9fd8b6",
+ "pinned_url": "https://raw.githubusercontent.com/PolicyEngine/policyengine-uk-data/9514dfb7ec607897c9f7122a2e073b922c9fd8b6/policyengine_uk_data/storage/enhanced_cps_2025.h5",
+ "note": "UK calibrated transfer dataset derived from benchmark-compatible PolicyEngine US Enhanced CPS households. The artifact is checked in to the public PolicyEngine/policyengine-uk-data GitHub repository at the pinned commit; later commits in that repository may rebuild the file. It is not native UK survey microdata, enhanced FRS, or population-representative."
},
"scope": {
"households": {
diff --git a/policybench/config.py b/policybench/config.py
index ac557ed..3195361 100644
--- a/policybench/config.py
+++ b/policybench/config.py
@@ -19,6 +19,13 @@
# Canonical default benchmark models. This list should track the published
# no-tools leaderboard rather than every model ever probed in the repo.
+#
+# Most identifiers below are provider aliases (e.g. ``claude-opus-4.7`` or
+# ``gpt-5.5``), not dated revisions. Provider responses can be routed to
+# different underlying weights over time. Runs after 2026-05-01 capture
+# ``provider_response_id``, ``provider_system_fingerprint``, and
+# ``provider_resolved_model`` in ``predictions.csv.gz``; older snapshots
+# only have the alias and the raw response payload.
MODELS = {
"claude-opus-4.7": "claude-opus-4-7",
"claude-sonnet-4.6": "claude-sonnet-4-6",
diff --git a/policybench/eval_no_tools.py b/policybench/eval_no_tools.py
index 2a18a07..2173c55 100644
--- a/policybench/eval_no_tools.py
+++ b/policybench/eval_no_tools.py
@@ -120,6 +120,21 @@ def _get_usage_value(obj, key: str):
return getattr(obj, key, None)
+def _extract_provider_fingerprint(response) -> dict:
+ """Capture the provider response id, system fingerprint, and resolved model.
+
+ These fields make it possible to audit which underlying model build a
+ provider routed an alias to (e.g. ``claude-opus-4-7`` resolving to a
+ dated weights revision). Providers that do not report a particular field
+ return ``None`` for that field.
+ """
+ return {
+ "provider_response_id": _get_usage_value(response, "id"),
+ "provider_system_fingerprint": _get_usage_value(response, "system_fingerprint"),
+ "provider_resolved_model": _get_usage_value(response, "model"),
+ }
+
+
def _extract_usage_metadata(
response, model_id: str, messages: list[dict], content: str
) -> dict:
@@ -172,6 +187,7 @@ def _extract_usage_metadata(
provider_reported_cost_usd is None and total_cost_usd is not None
),
"estimated_cost_usd": total_cost_usd,
+ **_extract_provider_fingerprint(response),
}
@@ -566,6 +582,13 @@ def _sum_optional_numbers(values: Iterable[float | int | None]) -> float | None:
return sum(present)
+def _first_non_null(values: Iterable):
+ for value in values:
+ if value is not None and value != "":
+ return value
+ return None
+
+
def _combine_raw_responses(raw_responses: list[str | None]) -> str | None:
present = [raw_response for raw_response in raw_responses if raw_response]
if not present:
@@ -605,6 +628,9 @@ def _aggregate_request_results(results: list[dict]) -> dict:
"total_cost_usd": None,
"cost_is_estimated": None,
"estimated_cost_usd": None,
+ "provider_response_id": None,
+ "provider_system_fingerprint": None,
+ "provider_resolved_model": None,
}
cost_flags = [result.get("cost_is_estimated") for result in results]
@@ -650,6 +676,15 @@ def _aggregate_request_results(results: list[dict]) -> dict:
"estimated_cost_usd": _sum_optional_numbers(
result.get("estimated_cost_usd") for result in results
),
+ "provider_response_id": _first_non_null(
+ result.get("provider_response_id") for result in results
+ ),
+ "provider_system_fingerprint": _first_non_null(
+ result.get("provider_system_fingerprint") for result in results
+ ),
+ "provider_resolved_model": _first_non_null(
+ result.get("provider_resolved_model") for result in results
+ ),
}
@@ -1231,6 +1266,15 @@ def run_single_no_tools(
"estimated_cost_usd": _sum_optional_field(
chunk_results, "estimated_cost_usd"
),
+ "provider_response_id": _first_non_null(
+ result.get("provider_response_id") for result in chunk_results
+ ),
+ "provider_system_fingerprint": _first_non_null(
+ result.get("provider_system_fingerprint") for result in chunk_results
+ ),
+ "provider_resolved_model": _first_non_null(
+ result.get("provider_resolved_model") for result in chunk_results
+ ),
}
request_results = []
@@ -1695,6 +1739,13 @@ def run_no_tools_eval(
if estimated_cost_usd is not None
else None
),
+ "provider_response_id": result.get("provider_response_id"),
+ "provider_system_fingerprint": result.get(
+ "provider_system_fingerprint"
+ ),
+ "provider_resolved_model": result.get(
+ "provider_resolved_model"
+ ),
}
)
@@ -1794,6 +1845,9 @@ def run_no_tools_single_output_eval(
"total_cost_usd": None,
"cost_is_estimated": None,
"estimated_cost_usd": None,
+ "provider_response_id": None,
+ "provider_system_fingerprint": None,
+ "provider_resolved_model": None,
}
call_id = ":".join(
@@ -1827,6 +1881,13 @@ def run_no_tools_single_output_eval(
"total_cost_usd": result.get("total_cost_usd"),
"cost_is_estimated": result.get("cost_is_estimated"),
"estimated_cost_usd": result.get("estimated_cost_usd"),
+ "provider_response_id": result.get("provider_response_id"),
+ "provider_system_fingerprint": result.get(
+ "provider_system_fingerprint"
+ ),
+ "provider_resolved_model": result.get(
+ "provider_resolved_model"
+ ),
}
)
diff --git a/policybench/policyengine_runtime.py b/policybench/policyengine_runtime.py
index e4e91f9..bc165c0 100644
--- a/policybench/policyengine_runtime.py
+++ b/policybench/policyengine_runtime.py
@@ -46,17 +46,29 @@
UK_TRANSFER_DATASET = {
"runtime_dataset": "enhanced_cps_2025",
+ "runtime_dataset_filename": "enhanced_cps_2025.h5",
+ "runtime_dataset_repo": "PolicyEngine/policyengine-uk-data",
+ "runtime_dataset_pinned_commit": ("9514dfb7ec607897c9f7122a2e073b922c9fd8b6"),
+ "runtime_dataset_pinned_url": (
+ "https://raw.githubusercontent.com/PolicyEngine/"
+ "policyengine-uk-data/9514dfb7ec607897c9f7122a2e073b922c9fd8b6/"
+ "policyengine_uk_data/storage/enhanced_cps_2025.h5"
+ ),
"runtime_dataset_uri": (
- "policyengine_uk_data/storage/enhanced_cps_2025.h5 "
- "from the public UK calibrated transfer artifact"
+ "policyengine_uk_data/storage/enhanced_cps_2025.h5 from the public "
+ "PolicyEngine/policyengine-uk-data repository, pinned to commit "
+ "9514dfb7ec607897c9f7122a2e073b922c9fd8b6"
),
"runtime_dataset_sha256": (
"199ebc61d29231b4799ad337a95393765b5fb5aede1834b93ff2acecceded866"
),
"runtime_dataset_note": (
"UK calibrated transfer dataset derived from benchmark-compatible "
- "PolicyEngine US Enhanced CPS households; not native UK survey microdata "
- "or enhanced FRS."
+ "PolicyEngine US Enhanced CPS households. The artifact is checked "
+ "into the public PolicyEngine/policyengine-uk-data GitHub "
+ "repository at the pinned commit; subsequent commits in that "
+ "repository may rebuild the file. It is not native UK survey "
+ "microdata, enhanced FRS, or population-representative."
),
}
diff --git a/policybench/scenarios.py b/policybench/scenarios.py
index f15fb59..a3087c4 100644
--- a/policybench/scenarios.py
+++ b/policybench/scenarios.py
@@ -1,7 +1,10 @@
"""Household scenario generation for PolicyBench."""
+import hashlib
import json
import os
+import urllib.error
+import urllib.request
from dataclasses import dataclass, field
from functools import lru_cache
from pathlib import Path
@@ -300,14 +303,26 @@
"employment_income_before_lsr",
)
-UK_DATASET_CANDIDATES = (
- Path(__file__).resolve().parents[1] / "data" / "enhanced_cps_2025.h5",
+UK_TRANSFER_DATASET_FILENAME = "enhanced_cps_2025.h5"
+UK_TRANSFER_DATASET_SHA256 = (
+ "199ebc61d29231b4799ad337a95393765b5fb5aede1834b93ff2acecceded866"
+)
+UK_TRANSFER_DATASET_PINNED_COMMIT = "9514dfb7ec607897c9f7122a2e073b922c9fd8b6"
+UK_TRANSFER_DATASET_PINNED_URL = (
+ "https://raw.githubusercontent.com/PolicyEngine/policyengine-uk-data/"
+ f"{UK_TRANSFER_DATASET_PINNED_COMMIT}/policyengine_uk_data/storage/"
+ f"{UK_TRANSFER_DATASET_FILENAME}"
+)
+UK_TRANSFER_DATASET_PUBLIC_REPO = "https://github.com/PolicyEngine/policyengine-uk-data"
+UK_TRANSFER_DATASET_LOCAL_CANDIDATES = (
+ Path(__file__).resolve().parents[1] / "data" / UK_TRANSFER_DATASET_FILENAME,
Path(__file__).resolve().parents[2]
/ "policyengine-uk-data"
/ "policyengine_uk_data"
/ "storage"
- / "enhanced_cps_2025.h5",
+ / UK_TRANSFER_DATASET_FILENAME,
)
+UK_DATASET_CANDIDATES = UK_TRANSFER_DATASET_LOCAL_CANDIDATES
@dataclass(frozen=True)
@@ -602,23 +617,141 @@ def load_enhanced_cps_person_frame() -> tuple[pd.DataFrame, int]:
return pd.DataFrame(values), dataset_year
+def _hash_file(path: Path) -> str:
+ sha = hashlib.sha256()
+ with path.open("rb") as handle:
+ for chunk in iter(lambda: handle.read(1024 * 1024), b""):
+ sha.update(chunk)
+ return sha.hexdigest()
+
+
+def _verify_uk_transfer_artifact(path: Path) -> Path:
+ """Confirm the artifact's sha256 matches the pinned snapshot value."""
+ actual = _hash_file(path)
+ if actual != UK_TRANSFER_DATASET_SHA256:
+ raise ValueError(
+ f"UK transfer dataset at {path} has sha256 {actual}, expected "
+ f"{UK_TRANSFER_DATASET_SHA256}. Update the artifact or the pinned "
+ "expected hash."
+ )
+ return path
+
+
+UK_TRANSFER_DOWNLOAD_TIMEOUT_SECONDS = 60
+
+
+def _download_uk_transfer_artifact(destination: Path) -> Path:
+ """Download the snapshot's UK transfer artifact from the pinned commit.
+
+ Streams the download to a per-process temp file alongside the destination,
+ sha256-verifies the temp file, and only atomically ``replace()``s the
+ destination on success. The per-process temp suffix avoids collisions
+ when two policybench processes share a cache directory. Raises
+ ``RuntimeError`` on network errors or HTTP failures and ``ValueError``
+ (via verification) on hash mismatch.
+ """
+ import tempfile
+
+ destination.parent.mkdir(parents=True, exist_ok=True)
+ request = urllib.request.Request(
+ UK_TRANSFER_DATASET_PINNED_URL,
+ headers={"User-Agent": "policybench"},
+ )
+ fd, tmp_name = tempfile.mkstemp(
+ prefix=destination.name + ".",
+ suffix=".part",
+ dir=str(destination.parent),
+ )
+ tmp_path = Path(tmp_name)
+ try:
+ with os.fdopen(fd, "wb") as handle:
+ try:
+ with urllib.request.urlopen(
+ request, timeout=UK_TRANSFER_DOWNLOAD_TIMEOUT_SECONDS
+ ) as response:
+ if getattr(response, "status", 200) >= 400:
+ raise RuntimeError(
+ f"Download of {UK_TRANSFER_DATASET_PINNED_URL} "
+ f"returned HTTP {response.status}."
+ )
+ for chunk in iter(lambda: response.read(1024 * 1024), b""):
+ if not chunk:
+ break
+ handle.write(chunk)
+ except (urllib.error.URLError, TimeoutError, OSError) as exc:
+ raise RuntimeError(
+ "Failed to download UK transfer dataset from "
+ f"{UK_TRANSFER_DATASET_PINNED_URL}: {exc}. Set "
+ "POLICYBENCH_UK_DATASET_PATH to a local copy or "
+ "POLICYBENCH_UK_DATASET_DOWNLOAD=0 to disable the "
+ "download step."
+ ) from exc
+ _verify_uk_transfer_artifact(tmp_path)
+ tmp_path.replace(destination)
+ return destination
+ except BaseException:
+ tmp_path.unlink(missing_ok=True)
+ raise
+
+
def get_uk_dataset_path() -> Path:
- """Locate the local public UK calibrated transfer dataset artifact."""
+ """Locate the UK calibrated transfer dataset published in policyengine-uk-data.
+
+ Resolution order: ``POLICYBENCH_UK_DATASET_PATH`` env var, then the
+ sibling ``policyengine-uk-data`` checkout or a local ``data/`` copy, then
+ a pinned-commit download from the public ``policyengine-uk-data`` GitHub
+ repo. The returned path is sha256-verified against the snapshot value.
+ Set ``POLICYBENCH_UK_DATASET_DOWNLOAD=0`` to disable the download step.
+ Set ``POLICYBENCH_UK_DATASET_CACHE`` to override the download cache root
+ (default ``~/.cache/policybench``).
+
+ If ``POLICYBENCH_UK_DATASET_PATH`` is set to a path that does not exist or
+ does not match the pinned sha256, this raises ``FileNotFoundError`` or
+ ``ValueError`` rather than silently falling through to the download path.
+ """
configured = os.environ.get("POLICYBENCH_UK_DATASET_PATH")
if configured:
path = Path(configured).expanduser()
- if path.exists():
- return path
+ if not path.exists():
+ raise FileNotFoundError(
+ f"POLICYBENCH_UK_DATASET_PATH={configured} does not exist. "
+ "Unset the variable or point it at a local copy of "
+ f"{UK_TRANSFER_DATASET_FILENAME} (sha256 "
+ f"{UK_TRANSFER_DATASET_SHA256})."
+ )
+ return _verify_uk_transfer_artifact(path)
for candidate in UK_DATASET_CANDIDATES:
if candidate.exists():
- return candidate
+ try:
+ return _verify_uk_transfer_artifact(candidate)
+ except ValueError:
+ continue
+
+ if os.environ.get("POLICYBENCH_UK_DATASET_DOWNLOAD", "1") != "0":
+ cache_root = Path(
+ os.environ.get(
+ "POLICYBENCH_UK_DATASET_CACHE",
+ Path.home() / ".cache" / "policybench",
+ )
+ ).expanduser()
+ cached = cache_root / UK_TRANSFER_DATASET_FILENAME
+ if cached.exists():
+ try:
+ return _verify_uk_transfer_artifact(cached)
+ except ValueError:
+ cached.unlink(missing_ok=True)
+ return _download_uk_transfer_artifact(cached)
searched = "\n".join(f"- {candidate}" for candidate in UK_DATASET_CANDIDATES)
raise FileNotFoundError(
- "Could not find a local UK calibrated transfer dataset. Set "
- "POLICYBENCH_UK_DATASET_PATH or place the artifact in one of:\n"
- f"{searched}"
+ "Could not find the UK calibrated transfer dataset. Either set "
+ "POLICYBENCH_UK_DATASET_PATH to a local copy of "
+ f"{UK_TRANSFER_DATASET_FILENAME} (sha256 "
+ f"{UK_TRANSFER_DATASET_SHA256}), enable downloads, or place the "
+ "artifact in one of:\n"
+ f"{searched}\n"
+ f"The artifact is published at {UK_TRANSFER_DATASET_PINNED_URL}."
)