Fix residual paper issues and scope site improvements#8
Open
Fix residual paper issues and scope site improvements#8
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
UK transfer dataset - Replace "public UK calibrated transfer artifact" wording with concrete provenance: the artifact is checked in to PolicyEngine/policyengine-uk-data at pinned commit 9514dfb7, sha256 199ebc61. Manifest, paper, and runtime metadata all reference the same pinned commit URL. - scenarios.py now sha256-verifies the local file and falls back to a download from the pinned raw.githubusercontent.com URL when no local copy is available. POLICYBENCH_UK_DATASET_DOWNLOAD=0 disables the download fallback. Validation framing - Replace "penny-level agreement for the vast majority of 2021 test cases" with the source's actual qualitative phrasing and an explicit note that we do not restate it as a single percentage. Model-alias instability - Capture provider_response_id, provider_system_fingerprint, and provider_resolved_model in eval_no_tools predictions for runs after this snapshot, so future snapshots can pin alias resolutions explicitly. - Manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that most config.py model IDs are provider aliases. docs/ consolidation - Delete duplicate prose in docs/introduction.md, docs/methodology.md, docs/discussion.md, docs/references.md. paper/index.qmd is now the canonical manuscript; docs/ keeps the operational runbook and benchmark card and points to the rendered manuscript. - Add docs/paper.md as a thin reading guide and update myst.yml. Methodology and scope refinements - Reframe the bounded score as step-credit by error band (paper methodology section). The mean-of-four-thresholds is mathematically equivalent to step partial credit because the thresholds are nested. - Expand the bootstrap caveat to enumerate which uncertainty sources (prompt variance, decoding stochasticity, provider drift, reference- output uncertainty) the household-resampling intervals do not cover. - Report the Enhanced CPS exclusion fraction (27.0% of 41,314 source households fail the single-tax-unit/single-family/single-SPM-unit filter) and the UK exclusion fraction (0.1%). - State explicitly that filing status is not in the prompt and is inferred by the reference computation from tax-unit role flags. New paper sections - @tbl-fed-state: US within-10% accuracy on federal vs state refundable credits and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. - @tbl-impact-floor: top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0, so readers can see whether the 0.3 default is load-bearing. Site scoping - New docs/site_improvements_scope.md ranks improvements to policybench.org from open-set leakage banner / sensitivity selector / bootstrap rank intervals (tier 1) through cross-country compare, per-model deep-dive pages, cost surfacing, scenario filtering, and protected leaderboard (tiers 2-4). Verification - uv run pytest -q (186 passed) - uv run ruff check . (clean) - uv run ruff format --check . (clean) - bun run lint (clean) - bun run build (clean) - uv run python paper/render_paper.py (regenerated PDF + web) - Manifest hashes refreshed for dashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.
ec531d5 to
6dc2107
Compare
- eval_no_tools.run_no_tools_single_output_eval and its exception fallback now write provider_response_id, provider_system_fingerprint, and provider_resolved_model so subsequent runs through the per-output path match the chunked path. - _download_uk_transfer_artifact downloads with a 60s urlopen timeout, streams to a .part sidecar, sha256-verifies the temp file, and only atomically replace()s the destination on success. Network errors and hash mismatches surface explicit messages pointing at the POLICYBENCH_UK_DATASET_PATH and POLICYBENCH_UK_DATASET_DOWNLOAD escape valves. - get_uk_dataset_path raises FileNotFoundError when POLICYBENCH_UK_DATASET_PATH is set to a missing path rather than silently falling through to the download. - impact_floor_sensitivity in paper/index.qmd now calls the canonical policybench.analysis.household_equal_impact_scores against the committed snapshot ground_truth + predictions.csv.gz, so @tbl-impact- floor uses the same household-equal impact statistic as us/uk_impact_summary_by_model.csv. The shadow output-group magnitude proxy is gone. - Re-rendered PDF and web bundle; refreshed manifest hashes for the rendered paper artifacts and dashboard export.
- impact_floor_sensitivity guards an empty country_table result before pd.concat so a zero-coverage floor cannot crash the paper render. - _download_uk_transfer_artifact uses tempfile.mkstemp for a unique per-process .part suffix in the cache directory, so simultaneous policybench processes can't clobber each other's mid-stream bytes. The except path unconditionally cleans up the temp file on any exception (including BaseException) so a failed verify or interrupt doesn't leave junk in the cache. - Re-rendered PDF + web bundle and refreshed manifest hashes.
impact_floor_sensitivity now skips floors that miss either required country or yield fewer than three global rows, rather than relying on global_scores' single-country fallback or letting iloc[0..2] raise IndexError when the table is empty.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves the residual paper issues flagged in the post-merge review of the 2026-05-01 snapshot and adds a scoping document for next-round site improvements.
Paper / data residuals
PolicyEngine/policyengine-uk-dataat pinned commit9514dfb7, sha256199ebc61…. The manifest, paper text, and runtime metadata all reference the same pinned commit URL.policybench/scenarios.pynow sha256-verifies the local file and falls back to a download from the pinnedraw.githubusercontent.comURL when no local copy is available.POLICYBENCH_UK_DATASET_DOWNLOAD=0disables the download fallback.provider_response_id,provider_system_fingerprint, andprovider_resolved_modelineval_no_toolspredictions for runs after this snapshot, so future snapshots can pin alias resolutions. The manifest reproducibility note documents that the 2026-05-01 snapshot predates fingerprint capture and that mostconfig.pymodel IDs are provider aliases.Manuscript and docs cleanup
docs/consolidation. Delete duplicate prose indocs/introduction.md,docs/methodology.md,docs/discussion.md,docs/references.md.paper/index.qmdis now the canonical manuscript;docs/keeps the operational runbook (results.md) and benchmark card and points to the rendered manuscript via a thindocs/paper.md.New paper sections
@tbl-fed-state. US within-10% accuracy on federal refundable credits, state refundable credits, and the household-level joint, surfacing how marginal accuracy hides joint federal/state credit errors. Joint accuracy is consistently 5–15 points below either marginal hit rate.@tbl-impact-floor. Top three global ranks under household-equal impact-score floors of 0.0, 0.1, 0.3, 0.5, and 1.0 — so readers can see whether the 0.3 default is load-bearing.Site scoping
docs/site_improvements_scope.mdranks improvements to policybench.org from tier 1 (open-set leakage banner, sensitivity selector, bootstrap rank intervals, per-model deep-dive page) through tier 2–3 (cross-country compare, cost surfacing, scenario filtering) up to tier 4 (held-out protected leaderboard, live evaluation, country expansion). Recommended first PR combines the three tier-1 leaderboard items.Verification
uv run pytest -q— 186 passeduv run ruff check .— cleanuv run ruff format --check .— cleanbun run lint— cleanbun run build— cleanuv run python paper/render_paper.py— regenerated PDF + webdashboard_export, rendered PDF, and web bundle to match the regenerated artifacts.Test plan
manifest.jsonhashes still match committed files🤖 Generated with Claude Code