Fix residual paper issues and scope site improvements by MaxGhenis · Pull Request #8 · PolicyEngine/policybench

MaxGhenis · 2026-05-02T12:49:03Z

Summary

Resolves the residual paper issues flagged in the post-merge review of the 2026-05-01 snapshot and adds a scoping document for next-round site improvements.

Paper / data residuals

UK transfer dataset. Replace the "public UK calibrated transfer artifact" wording with concrete provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data at pinned commit 9514dfb7, sha256 199ebc61…. The manifest, paper text, and runtime metadata all reference the same pinned commit URL. policybench/scenarios.py now sha256-verifies the local file and falls back to a download from the pinned raw.githubusercontent.com URL when no local copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the download fallback.
Validation framing. Replace "penny-level agreement for the vast majority of 2021 test cases" with the source's actual qualitative phrasing and an explicit note that we do not restate it as a single percentage.
Model-alias instability. Capture provider_response_id, provider_system_fingerprint, and provider_resolved_model in eval_no_tools predictions for runs after this snapshot, so future snapshots can pin alias resolutions. The manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that most config.py model IDs are provider aliases.

Manuscript and docs cleanup

docs/ consolidation. Delete duplicate prose in docs/introduction.md, docs/methodology.md, docs/discussion.md, docs/references.md. paper/index.qmd is now the canonical manuscript; docs/ keeps the operational runbook (results.md) and benchmark card and points to the rendered manuscript via a thin docs/paper.md.
Methodology refinements. Reframe the bounded score as step-credit by error band (the mean of four nested thresholds is mathematically equivalent to step partial credit). Expand the bootstrap caveat to enumerate the uncertainty sources the household-resampling intervals do not capture (prompt variance, decoding stochasticity, provider drift, reference-output uncertainty).
Exclusion-fraction reporting. Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source households fail the single-tax-unit / single-family / single-SPM-unit filter) and the UK exclusion fraction (0.1%). State explicitly that filing status is not in the prompt and is inferred by the reference computation from tax-unit role flags.

New paper sections

@tbl-fed-state. US within-10% accuracy on federal refundable credits, state refundable credits, and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. Joint accuracy is consistently 5–15 points below either marginal hit rate.
@tbl-impact-floor. Top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0 — so readers can see whether the 0.3 default is load-bearing.

Site scoping

New docs/site_improvements_scope.md ranks improvements to policybench.org from tier 1 (open-set leakage banner, sensitivity selector, bootstrap rank intervals, per-model deep-dive page) through tier 2–3 (cross-country compare, cost surfacing, scenario filtering) up to tier 4 (held-out protected leaderboard, live evaluation, country expansion). Recommended first PR combines the three tier-1 leaderboard items.

Verification

uv run pytest -q — 186 passed
uv run ruff check . — clean
uv run ruff format --check . — clean
bun run lint — clean
bun run build — clean
uv run python paper/render_paper.py — regenerated PDF + web
Manifest hashes refreshed for dashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.

Test plan

CI passes on the branch
Spot-check the rendered PDF for the new federal+state and impact-floor tables
Verify manifest.json hashes still match committed files

🤖 Generated with Claude Code

vercel · 2026-05-02T12:49:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench	Ready	Preview, Comment	May 6, 2026 2:17pm

UK transfer dataset - Replace "public UK calibrated transfer artifact" wording with concrete provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data at pinned commit 9514dfb7, sha256 199ebc61. Manifest, paper, and runtime metadata all reference the same pinned commit URL. - scenarios.py now sha256-verifies the local file and falls back to a download from the pinned raw.githubusercontent.com URL when no local copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the download fallback. Validation framing - Replace "penny-level agreement for the vast majority of 2021 test cases" with the source's actual qualitative phrasing and an explicit note that we do not restate it as a single percentage. Model-alias instability - Capture provider_response_id, provider_system_fingerprint, and provider_resolved_model in eval_no_tools predictions for runs after this snapshot, so future snapshots can pin alias resolutions explicitly. - Manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that most config.py model IDs are provider aliases. docs/ consolidation - Delete duplicate prose in docs/introduction.md, docs/methodology.md, docs/discussion.md, docs/references.md. paper/index.qmd is now the canonical manuscript; docs/ keeps the operational runbook and benchmark card and points to the rendered manuscript. - Add docs/paper.md as a thin reading guide and update myst.yml. Methodology and scope refinements - Reframe the bounded score as step-credit by error band (paper methodology section). The mean-of-four-thresholds is mathematically equivalent to step partial credit because the thresholds are nested. - Expand the bootstrap caveat to enumerate which uncertainty sources (prompt variance, decoding stochasticity, provider drift, reference- output uncertainty) the household-resampling intervals do not cover. - Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source households fail the single-tax-unit/single-family/single-SPM-unit filter) and the UK exclusion fraction (0.1%). - State explicitly that filing status is not in the prompt and is inferred by the reference computation from tax-unit role flags. New paper sections - @tbl-fed-state: US within-10% accuracy on federal vs state refundable credits and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. - @tbl-impact-floor: top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0, so readers can see whether the 0.3 default is load-bearing. Site scoping - New docs/site_improvements_scope.md ranks improvements to policybench.org from open-set leakage banner / sensitivity selector / bootstrap rank intervals (tier 1) through cross-country compare, per-model deep-dive pages, cost surfacing, scenario filtering, and protected leaderboard (tiers 2-4). Verification - uv run pytest -q (186 passed) - uv run ruff check . (clean) - uv run ruff format --check . (clean) - bun run lint (clean) - bun run build (clean) - uv run python paper/render_paper.py (regenerated PDF + web) - Manifest hashes refreshed for dashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.

- eval_no_tools.run_no_tools_single_output_eval and its exception fallback now write provider_response_id, provider_system_fingerprint, and provider_resolved_model so subsequent runs through the per-output path match the chunked path. - _download_uk_transfer_artifact downloads with a 60s urlopen timeout, streams to a .part sidecar, sha256-verifies the temp file, and only atomically replace()s the destination on success. Network errors and hash mismatches surface explicit messages pointing at the POLICYBENCH_UK_DATASET_PATH and POLICYBENCH_UK_DATASET_DOWNLOAD escape valves. - get_uk_dataset_path raises FileNotFoundError when POLICYBENCH_UK_DATASET_PATH is set to a missing path rather than silently falling through to the download. - impact_floor_sensitivity in paper/index.qmd now calls the canonical policybench.analysis.household_equal_impact_scores against the committed snapshot ground_truth + predictions.csv.gz, so @tbl-impact- floor uses the same household-equal impact statistic as us/uk_impact_summary_by_model.csv. The shadow output-group magnitude proxy is gone. - Re-rendered PDF and web bundle; refreshed manifest hashes for the rendered paper artifacts and dashboard export.

- impact_floor_sensitivity guards an empty country_table result before pd.concat so a zero-coverage floor cannot crash the paper render. - _download_uk_transfer_artifact uses tempfile.mkstemp for a unique per-process .part suffix in the cache directory, so simultaneous policybench processes can't clobber each other's mid-stream bytes. The except path unconditionally cleans up the temp file on any exception (including BaseException) so a failed verify or interrupt doesn't leave junk in the cache. - Re-rendered PDF + web bundle and refreshed manifest hashes.

impact_floor_sensitivity now skips floors that miss either required country or yield fewer than three global rows, rather than relying on global_scores' single-country fallback or letting iloc[0..2] raise IndexError when the table is empty.

MaxGhenis force-pushed the fix-paper-residuals branch from ec531d5 to 6dc2107 Compare May 6, 2026 10:35

vercel Bot deployed to Preview May 6, 2026 10:38 View deployment

vercel Bot deployed to Preview May 6, 2026 12:44 View deployment

vercel Bot deployed to Preview May 6, 2026 13:47 View deployment

Round-3 review followups (PR #8)

8f3376c

impact_floor_sensitivity now skips floors that miss either required country or yield fewer than three global rows, rather than relying on global_scores' single-country fallback or letting iloc[0..2] raise IndexError when the table is empty.

vercel Bot deployed to Preview May 6, 2026 14:17 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix residual paper issues and scope site improvements#8

Fix residual paper issues and scope site improvements#8
MaxGhenis wants to merge 4 commits intomainfrom
fix-paper-residuals

MaxGhenis commented May 2, 2026

Uh oh!

vercel Bot commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented May 2, 2026

Summary

Paper / data residuals

Manuscript and docs cleanup

New paper sections

Site scoping

Verification

Test plan

Uh oh!

vercel Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 2, 2026 •

edited

Loading