Skip to content

Fix residual paper issues and scope site improvements#8

Open
MaxGhenis wants to merge 4 commits intomainfrom
fix-paper-residuals
Open

Fix residual paper issues and scope site improvements#8
MaxGhenis wants to merge 4 commits intomainfrom
fix-paper-residuals

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Resolves the residual paper issues flagged in the post-merge review of the 2026-05-01 snapshot and adds a scoping document for next-round site improvements.

Paper / data residuals

  • UK transfer dataset. Replace the "public UK calibrated transfer artifact" wording with concrete provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data at pinned commit 9514dfb7, sha256 199ebc61…. The manifest, paper text, and runtime metadata all reference the same pinned commit URL. policybench/scenarios.py now sha256-verifies the local file and falls back to a download from the pinned raw.githubusercontent.com URL when no local copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the download fallback.
  • Validation framing. Replace "penny-level agreement for the vast majority of 2021 test cases" with the source's actual qualitative phrasing and an explicit note that we do not restate it as a single percentage.
  • Model-alias instability. Capture provider_response_id, provider_system_fingerprint, and provider_resolved_model in eval_no_tools predictions for runs after this snapshot, so future snapshots can pin alias resolutions. The manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that most config.py model IDs are provider aliases.

Manuscript and docs cleanup

  • docs/ consolidation. Delete duplicate prose in docs/introduction.md, docs/methodology.md, docs/discussion.md, docs/references.md. paper/index.qmd is now the canonical manuscript; docs/ keeps the operational runbook (results.md) and benchmark card and points to the rendered manuscript via a thin docs/paper.md.
  • Methodology refinements. Reframe the bounded score as step-credit by error band (the mean of four nested thresholds is mathematically equivalent to step partial credit). Expand the bootstrap caveat to enumerate the uncertainty sources the household-resampling intervals do not capture (prompt variance, decoding stochasticity, provider drift, reference-output uncertainty).
  • Exclusion-fraction reporting. Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source households fail the single-tax-unit / single-family / single-SPM-unit filter) and the UK exclusion fraction (0.1%). State explicitly that filing status is not in the prompt and is inferred by the reference computation from tax-unit role flags.

New paper sections

  • @tbl-fed-state. US within-10% accuracy on federal refundable credits, state refundable credits, and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. Joint accuracy is consistently 5–15 points below either marginal hit rate.
  • @tbl-impact-floor. Top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0 — so readers can see whether the 0.3 default is load-bearing.

Site scoping

  • New docs/site_improvements_scope.md ranks improvements to policybench.org from tier 1 (open-set leakage banner, sensitivity selector, bootstrap rank intervals, per-model deep-dive page) through tier 2–3 (cross-country compare, cost surfacing, scenario filtering) up to tier 4 (held-out protected leaderboard, live evaluation, country expansion). Recommended first PR combines the three tier-1 leaderboard items.

Verification

  • uv run pytest -q — 186 passed
  • uv run ruff check . — clean
  • uv run ruff format --check . — clean
  • bun run lint — clean
  • bun run build — clean
  • uv run python paper/render_paper.py — regenerated PDF + web
  • Manifest hashes refreshed for dashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.

Test plan

  • CI passes on the branch
  • Spot-check the rendered PDF for the new federal+state and impact-floor tables
  • Verify manifest.json hashes still match committed files

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown

vercel Bot commented May 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench Ready Ready Preview, Comment May 6, 2026 2:17pm

Request Review

UK transfer dataset
- Replace "public UK calibrated transfer artifact" wording with concrete
  provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data
  at pinned commit 9514dfb7, sha256 199ebc61. Manifest, paper, and
  runtime metadata all reference the same pinned commit URL.
- scenarios.py now sha256-verifies the local file and falls back to a
  download from the pinned raw.githubusercontent.com URL when no local
  copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the
  download fallback.

Validation framing
- Replace "penny-level agreement for the vast majority of 2021 test
  cases" with the source's actual qualitative phrasing and an explicit
  note that we do not restate it as a single percentage.

Model-alias instability
- Capture provider_response_id, provider_system_fingerprint, and
  provider_resolved_model in eval_no_tools predictions for runs after
  this snapshot, so future snapshots can pin alias resolutions
  explicitly.
- Manifest reproducibility note documents that the 2026-05-01 snapshot
  predates fingerprint capture and that most config.py model IDs are
  provider aliases.

docs/ consolidation
- Delete duplicate prose in docs/introduction.md, docs/methodology.md,
  docs/discussion.md, docs/references.md. paper/index.qmd is now the
  canonical manuscript; docs/ keeps the operational runbook and
  benchmark card and points to the rendered manuscript.
- Add docs/paper.md as a thin reading guide and update myst.yml.

Methodology and scope refinements
- Reframe the bounded score as step-credit by error band (paper
  methodology section). The mean-of-four-thresholds is mathematically
  equivalent to step partial credit because the thresholds are nested.
- Expand the bootstrap caveat to enumerate which uncertainty sources
  (prompt variance, decoding stochasticity, provider drift, reference-
  output uncertainty) the household-resampling intervals do not cover.
- Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source
  households fail the single-tax-unit/single-family/single-SPM-unit
  filter) and the UK exclusion fraction (0.1%).
- State explicitly that filing status is not in the prompt and is
  inferred by the reference computation from tax-unit role flags.

New paper sections
- @tbl-fed-state: US within-10% accuracy on federal vs state refundable
  credits and the household-level joint, surfacing how marginal
  accuracy hides joint federal/state credit errors.
- @tbl-impact-floor: top three global ranks under household-equal
  impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0, so readers can
  see whether the 0.3 default is load-bearing.

Site scoping
- New docs/site_improvements_scope.md ranks improvements to
  policybench.org from open-set leakage banner / sensitivity selector /
  bootstrap rank intervals (tier 1) through cross-country compare,
  per-model deep-dive pages, cost surfacing, scenario filtering, and
  protected leaderboard (tiers 2-4).

Verification
- uv run pytest -q  (186 passed)
- uv run ruff check .  (clean)
- uv run ruff format --check .  (clean)
- bun run lint  (clean)
- bun run build  (clean)
- uv run python paper/render_paper.py  (regenerated PDF + web)
- Manifest hashes refreshed for dashboard_export, rendered PDF, and web
  bundle to match the regenerated artifacts.
- eval_no_tools.run_no_tools_single_output_eval and its exception
  fallback now write provider_response_id, provider_system_fingerprint,
  and provider_resolved_model so subsequent runs through the per-output
  path match the chunked path.
- _download_uk_transfer_artifact downloads with a 60s urlopen timeout,
  streams to a .part sidecar, sha256-verifies the temp file, and only
  atomically replace()s the destination on success. Network errors and
  hash mismatches surface explicit messages pointing at the
  POLICYBENCH_UK_DATASET_PATH and POLICYBENCH_UK_DATASET_DOWNLOAD escape
  valves.
- get_uk_dataset_path raises FileNotFoundError when
  POLICYBENCH_UK_DATASET_PATH is set to a missing path rather than
  silently falling through to the download.
- impact_floor_sensitivity in paper/index.qmd now calls the canonical
  policybench.analysis.household_equal_impact_scores against the
  committed snapshot ground_truth + predictions.csv.gz, so @tbl-impact-
  floor uses the same household-equal impact statistic as
  us/uk_impact_summary_by_model.csv. The shadow output-group magnitude
  proxy is gone.
- Re-rendered PDF and web bundle; refreshed manifest hashes for the
  rendered paper artifacts and dashboard export.
- impact_floor_sensitivity guards an empty country_table result before
  pd.concat so a zero-coverage floor cannot crash the paper render.
- _download_uk_transfer_artifact uses tempfile.mkstemp for a unique
  per-process .part suffix in the cache directory, so simultaneous
  policybench processes can't clobber each other's mid-stream bytes.
  The except path unconditionally cleans up the temp file on any
  exception (including BaseException) so a failed verify or interrupt
  doesn't leave junk in the cache.
- Re-rendered PDF + web bundle and refreshed manifest hashes.
impact_floor_sensitivity now skips floors that miss either required
country or yield fewer than three global rows, rather than relying on
global_scores' single-country fallback or letting iloc[0..2] raise
IndexError when the table is empty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant