diff --git a/app/public/paper/policybench.pdf b/app/public/paper/policybench.pdf index d8c0c05..f03d083 100644 Binary files a/app/public/paper/policybench.pdf and b/app/public/paper/policybench.pdf differ diff --git a/app/public/paper/web/index.html b/app/public/paper/web/index.html index cb6a72d..9fb7475 100644 --- a/app/public/paper/web/index.html +++ b/app/public/paper/web/index.html @@ -159,6 +159,7 @@

Benchmark design

  • within 10%
  • For binary outputs, the score is exact accuracy. This keeps 100 as the ceiling while still giving partial credit for near misses on amount outputs. PolicyBench also tracks mean absolute error and related error metrics, but those are secondary to the bounded score. This choice preserves exact-match comparability while avoiding the failure mode that recent numeric-evaluation papers have criticized (Abbood et al. 2025).

    +

    Because the four thresholds are nested (exact within-1% within-5% within-10%), averaging their indicator functions is equivalent to assigning step credit by error band: predictions that match exactly receive 1.00, predictions inside 1% but not exact receive 0.75, predictions inside 5% but outside 1% receive 0.50, predictions inside 10% but outside 5% receive 0.25, and predictions outside 10% receive 0.00. The score is therefore a partial-credit rule that rewards tighter agreement more than looser agreement, not an unweighted aggregate of independent hit rates.

    All requested US outputs are annual amounts or annual eligibility indicators for tax year 2026; UK outputs are annual amounts or annual eligibility indicators for fiscal year 2026-27. For currency amounts, the “exact” hit rate means within 1 currency unit of the reference value after numeric parsing. Percentage-threshold hit rates use relative error when the reference value is nonzero and the same 1 currency-unit absolute tolerance when the reference value is zero. Binary outputs are parsed as 0 or 1 eligibility flags and scored by exact classification accuracy. Missing or unparseable numeric answers receive zero score for that requested output.

    Aggregation proceeds in four steps. First, each household-output prediction receives a 0-100 score. Second, person-level coverage rows are averaged to their program group. Third, output groups are averaged with equal weight within each country. Fourth, global scores average the US and UK country scores for models run in both countries. The global score is therefore an equal-country summary for this benchmark sample, not a universally authoritative measure of tax-benefit competence. Household-equal impact scores and other alternative views are reported as sensitivity checks.

    The benchmark requires each model response to include numeric answers and one explanation per requested output. The headline score uses only the numeric answers. Explanations are retained for scenario exploration and qualitative error analysis; they should not be interpreted as faithful traces of model reasoning.

    @@ -295,35 +296,45 @@

    Frozen snapshot an 19 UK scenario source -enhanced_cps_2025.h5 public UK calibrated transfer artifact +enhanced_cps_2025.h5 20 +UK scenario-source repository +PolicyEngine/policyengine-uk-data + + +21 +UK scenario-source pinned commit +9514dfb7ec607897c9f7122a2e073b922c9fd8b6 + + +22 UK scenario-source SHA-256 199ebc61d29231b4799ad337a95393765b5fb5aede1834b93ff2acecceded866 -21 +23 Households 100 US and 100 UK -22 +24 Models 12 shared models -23 +25 Output groups 19 US and 7 UK -24 +26 Condition No tools, no web access, one structured response per household -25 +27 Response contract Numeric answer and non-empty explanation for every requested output @@ -491,7 +502,7 @@

    Frozen snapshot an

    Data and scenario construction

    United States

    -

    The US benchmark is built from Enhanced Current Population Survey (CPS)-derived households using PolicyEngine US. The sampled households are filtered to keep a single-tax-unit structure while retaining variation in filing status, household composition, and income sources. Prompts include nonzero promptable raw inputs across relevant entities rather than a hand-curated summary, so the models see many of the same facts the simulator receives.

    +

    The US benchmark is built from Enhanced Current Population Survey (CPS)-derived households using PolicyEngine US. The sampled households are filtered to keep a single-tax-unit, single-family, single-Supplemental Poverty Measure (SPM)-unit structure with at least one adult and a supported filing status. The 2024 Enhanced CPS source contains 41,314 households; 30,173 (73.0%) pass the filter and form the eligible draw. The 27.0% excluded by the filter include multi-tax-unit households (e.g., adult roommates), multi-family households, multi-SPM-unit households, and households whose head reports a filing status outside the supported set. These excluded compositions are exactly the kind of cases where federal/state credit allocations and benefit-unit rules become hardest, so the eligible draw is a tractable subset rather than the full distribution of US households. Prompts include nonzero promptable raw inputs across relevant entities rather than a hand-curated summary, so the models see many of the same facts the simulator receives. Filing status is not stated in the prompt; the reference computation infers it from tax-unit role flags. Models therefore see the same household facts that drive the reference filing-status assignment, but they do not receive that assignment as a label.

    The current US release evaluates 19 output groups spanning federal income tax, refundable credits, payroll and self-employment tax, state and local income tax, Supplemental Nutrition Assistance Program (SNAP), Supplemental Security Income (SSI), Temporary Assistance for Needy Families (TANF), Affordable Care Act (ACA) premium tax credits, school-meal eligibility, and person-level coverage eligibility for the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), Medicaid, the Children’s Health Insurance Program (CHIP), Medicare, Head Start, and Early Head Start.

    The output scope is intentionally narrower than the full PolicyEngine model. Table 3 summarizes the inclusion rule. The benchmark asks for WIC eligibility rather than a WIC dollar amount; WIC dollar values are used only as impact-weight proxies for coverage flags, not as requested model outputs.

    @@ -561,10 +572,10 @@

    United States

    United Kingdom

    -

    The UK benchmark is built from a calibrated public transfer dataset scored through PolicyEngine UK. The current public build starts from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS, maps those records into UK-facing inputs, and recalibrates them to selected UK targets. This creates a public UK-policy transfer benchmark without publishing restricted household microdata; it is not a representative evaluation over native UK household records and should not be treated as a substitute for Family Resources Survey (FRS)-based UK microdata. The current UK release evaluates seven outputs: Income Tax, National Insurance, Capital Gains Tax, Child Benefit, Universal Credit, Pension Credit, and Personal Independence Payment (PIP). Outputs that depend on status or award facts use prompt-visible facts rather than hidden take-up labels; for example, PIP-positive cases list the daily living and mobility award components used by PolicyEngine.

    +

    The UK benchmark is built from a calibrated public transfer dataset scored through PolicyEngine UK. The current public build starts from a public export of benchmark-compatible households from PolicyEngine US Enhanced CPS, maps those records into UK-facing inputs, and recalibrates them to selected UK targets. The resulting enhanced_cps_2025.h5 artifact is checked in to the public PolicyEngine/policyengine-uk-data GitHub repository; the manuscript pins commit 9514dfb7ec607897c9f7122a2e073b922c9fd8b6 so that a third party can retrieve the exact file used here. The artifact contains 28,532 households; 28,502 (99.9%) pass the eligibility filter that retains households with one benefit unit and one or two adults. This creates a public UK-policy transfer benchmark without publishing restricted household microdata; it is not a representative evaluation over native UK household records and should not be treated as a substitute for Family Resources Survey (FRS)-based UK microdata. The current UK release evaluates seven outputs: Income Tax, National Insurance, Capital Gains Tax, Child Benefit, Universal Credit, Pension Credit, and Personal Independence Payment (PIP). Outputs that depend on status or award facts use prompt-visible facts rather than hidden take-up labels; for example, PIP-positive cases list the daily living and mobility award components used by PolicyEngine.

    The UK data path is more synthetic than the enhanced FRS pipeline and inherits limitations from cross-country transfer, calibration choices, and the subset of variables that can be made prompt-visible. It supports the current public cross-country benchmark, but it is not equivalent to an enhanced-FRS-based benchmark and should not be used to make population-representative claims about UK households (Sutherland and Figari 2013).

    Reference-output credibility

    -

    PolicyBench treats PolicyEngine outputs as benchmark reference outputs, not as administrative records. The reference source is nevertheless stronger than an ad hoc answer key: PolicyEngine is open source, used for household calculators and reform analysis, and externally checked in specific domains. In the UK, No. 10’s data science team adapted PolicyEngine’s open-source microsimulation model for experimental policy simulation, with validation against external projections before use (Woodruff 2026; Ghenis 2026). In the US, PolicyEngine’s state tax modelling has been validated against the National Bureau of Economic Research (NBER) TAXSIM model, with reported penny-level agreement for the vast majority of 2021 test cases (PolicyEngine 2024; Feenberg and Coutts 1993). PolicyEngine has also signed a memorandum of understanding with the Federal Reserve Bank of Atlanta for future validation work against the Atlanta Fed’s Policy Rules Database (Ghenis and Makarchuk 2025; Federal Reserve Bank of Atlanta 2026). The Atlanta Fed sources are a caveat rather than evidence of completed validation for this benchmark: they document planned collaboration and the comparison source, not finished checks of the frozen PolicyBench outputs. Taken together, these sources support using PolicyEngine as a transparent reference implementation with partial external validation, but they do not validate every benchmark output.

    +

    PolicyBench treats PolicyEngine outputs as benchmark reference outputs, not as administrative records. The reference source is nevertheless stronger than an ad hoc answer key: PolicyEngine is open source, used for household calculators and reform analysis, and externally checked in specific domains. In the UK, No. 10’s data science team adapted PolicyEngine’s open-source microsimulation model for experimental policy simulation, with validation against external projections before use (Woodruff 2026; Ghenis 2026). In the US, PolicyEngine reports matching the National Bureau of Economic Research (NBER) TAXSIM-35 model to the cent on the vast majority of cases for the 2021 tax year across hundreds of thousands of tax units per state, with state-specific differences documented in PolicyEngine’s integration tests (PolicyEngine 2024; Feenberg and Coutts 1993). We do not restate that comparison as a single percentage because the published source uses qualitative phrasing rather than a headline accuracy number. PolicyEngine has also signed a memorandum of understanding with the Federal Reserve Bank of Atlanta for future validation work against the Atlanta Fed’s Policy Rules Database (Ghenis and Makarchuk 2025; Federal Reserve Bank of Atlanta 2026). The Atlanta Fed sources are a caveat rather than evidence of completed validation for this benchmark: they document planned collaboration and the comparison source, not finished checks of the frozen PolicyBench outputs. Taken together, these sources support using PolicyEngine as a transparent reference implementation with partial external validation, but they do not validate every benchmark output.

    This does not make PolicyEngine infallible. During benchmark development, we performed a manual, developer-led discrepancy review, with LLM assistance used to triage and summarize candidate cases surfaced by model explanations. Table 4 summarizes the main discrepancy classes reviewed before the frozen snapshot. In those reviewed classes, discrepancies reflected model mistakes, prompt ambiguity later corrected, or upstream data/model issues fixed before the frozen snapshot; the reviewed discrepancy classes did not identify unresolved PolicyEngine reference-output defects. This is a qualitative development audit, not an independent sampled validation study or exhaustive validation of every reference value.

    @@ -1129,16 +1140,95 @@

    Sensitivity to benchmark

    -

    Sampling uncertainty

    -

    The manuscript snapshot uses 100 households per country, so small score gaps should not be overinterpreted. Table 10 reports a household-resampling bootstrap over the frozen sample using 1,000 percentile bootstrap draws. It resamples households within each country, recomputes country scores with equal output-group weighting, then recomputes the equal-country global score. The table is a sampling-uncertainty check for this household draw, not uncertainty over future model releases or prompt variants.

    +

    The household-equal impact score weights each output group inside a household by a blend of an equal-weight floor and the absolute reference-value share. The default uses a 0.3 floor. Table 10 reports the top three global ranks under floors of 0.0, 0.1, 0.3, 0.5, and 1.0. Floor 1.0 collapses to the equal-output baseline; floor 0.0 is pure dollar-share weighting. Rank stability across this range means the impact view is not driven by the specific floor choice.

    -
    +
    +
    +
    +Table 10: Top global ranks under varying household-equal impact-score floors. +
    +
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    FloorRank 1Rank 2Rank 3
    00.0gpt-5.5 (43.6)gemini-3.1-pro-preview (37.5)grok-4.20 (37.0)
    10.1gpt-5.5 (47.6)gemini-3.1-pro-preview (42.0)grok-4.20 (41.6)
    20.3gpt-5.5 (55.7)gemini-3.1-pro-preview (51.1)grok-4.20 (50.7)
    30.5gpt-5.5 (63.8)gemini-3.1-pro-preview (60.1)grok-4.20 (59.8)
    41.0gpt-5.5 (83.9)gemini-3.1-pro-preview (82.7)grok-4.20 (82.5)
    + +
    +
    +
    +
    +
    +
    +

    Sampling uncertainty

    +

    The manuscript snapshot uses 100 households per country, so small score gaps should not be overinterpreted. Table 11 reports a household-resampling bootstrap over the frozen sample using 1,000 percentile bootstrap draws. It resamples households within each country, recomputes country scores with equal output-group weighting, then recomputes the equal-country global score. The intervals therefore quantify uncertainty from the specific 100-household draw used here. They do not capture (i) prompt-template variance, since the manuscript uses a single template per country with no paraphrase resamples; (ii) decoding stochasticity, since each model’s snapshot prediction is a single sample at the provider’s default decoding settings; (iii) provider-side drift after 2026-05-01 in alias-resolved model weights; or (iv) reference-output uncertainty inside PolicyEngine. Total uncertainty over those sources is wider than the household-bootstrap intervals.

    +
    +
    -Table 10: Household-resampling uncertainty for the global leaderboard. +Table 11: Household-resampling uncertainty for the global leaderboard.
    -
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    ModelFederal within 10%State within 10%Joint within 10%
    10grok-4.2095.089.084.0
    8gpt-5.596.086.082.0
    11grok-4.390.088.081.0
    7gpt-5.4-nano87.086.079.0
    6gpt-5.4-mini87.086.078.0
    4gemini-3.1-flash-lite-preview87.086.077.0
    5gemini-3.1-pro-preview86.085.076.0
    3gemini-3-flash-preview84.085.075.0
    2claude-sonnet-4.681.086.072.0
    1claude-opus-4.777.078.065.0
    9grok-4.1-fast72.087.064.0
    0claude-haiku-4.569.083.063.0
    + +
    +
    +
    +
    +
    +
    +

    Fourth, structured-output reliability can affect rankings. The current manuscript snapshot has full parse coverage for the included models, but the benchmark still tracks coverage because failures to return parseable numeric values should count as benchmark failures (Shorten et al. 2024).

    Limitations

    PolicyBench is not a substitute for a production tax-and-benefit calculator. Several caveats matter: