Skip to content

Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9

Open
MaxGhenis wants to merge 5 commits intomainfrom
site-tier1
Open

Site tier-1: open-set banner, sensitivity selector, bootstrap intervals#9
MaxGhenis wants to merge 5 commits intomainfrom
site-tier1

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented May 2, 2026

Summary

Implements the tier-1 leaderboard improvements from docs/site_improvements_scope.md, plus a shared sticky header that the /paper page now reuses (matching the home page chrome, without the Global/US/UK view selector). Round-2 review feedback is incorporated: the bootstrap estimator now matches the canonical row-equal scoring rule, and the global view auto-falls-back to Main when an active sensitivity slice has no rows in one country.

Header

  • Extract SiteHeader from Hero. The new component owns the sticky brand + nav + view-selector + action-link layout and supports an alwaysExpanded mode for pages that don't drive their own collapse.
  • Hero refactored to wrap SiteHeader and pass the country-aware subtitle, stat strip, and snapshot pill as expandedContent. Drops the Top score stat and the Leading: <model> sidebar — the leaderboard itself is the canonical source for both.
  • /paper uses SiteHeader with alwaysExpanded, no view selector, and a Benchmark action link. The page body keeps its eyebrow/buttons/iframe; the inline H1 is gone since the header carries the brand.
  • Picked up the upstream parallel design tweak: dropped the inline "by [PE logo]" tagline next to the title and added a persistent "by PolicyEngine" pill at the right edge of the header.
  • Collapsed nav items and action link are no longer keyboard-focusable while hidden (tabIndex={-1}, aria-hidden); the scroll listener short-circuits on alwaysExpanded.

Open-set banner and snapshot pill

  • role="note" warning-tinted banner above the leaderboard.
  • Snapshot date pill (Snapshot 2026-05-01) on the home stat row and next to the Manuscript eyebrow on /paper.

Sensitivity-view selector

  • Segmented control: Main / Amount only / Binary only / Positive cases / Zero cases. Selecting a view rescores models client-side from scenarioPredictions and reorders the table; the description for the active view appears inline. The control is role="group" with aria-labelledby; each button has aria-pressed.
  • Auto-fallback for global: when a slice has no rows in one country (e.g. Binary-only on Global, since UK has zero binary outputs), viewSupportsGlobal returns false, the leaderboard falls back to the Main view with a role="note" notice, and the offending button becomes aria-disabled with the click handler short-circuited so it cannot simultaneously be pressed and disabled.
  • New utilities under app/src/lib/:
    • scoring.ts ports score_single_prediction (mean of exact / within-1% / within-5% / within-10% for amount outputs; classification accuracy for binary; output-group resolution for person-expanded variables). Verified against canonical analysis.py for both countries' top models.
    • sensitivity.ts builds the per-row score table from a DashboardBundle and aggregates output-group means → country → global, preserving country-equal weighting. Exposes viewSupportsGlobal for global-validity checks.

Bootstrap rank intervals

  • bootstrap.ts implements household-resampling with a deterministic mulberry32 RNG (seed 42, DEFAULT_DRAWS = 400). Inside each draw it adds bucket sum and count directly across sampled scenarios so each row contributes equally to the output-group mean — matching the headline modelScoresForView estimator instead of collapsing each scenario to a per-scenario mean first.
  • ModelLeaderboard renders Rank N(-M) · 95% L-U next to each model's point estimate. Sample current Global view: Rank 1 has 95% CI ~74.7–79.6; Rank 2-3 cluster ~73.5–78.7. Tooltip names the bootstrap parameters.
  • Snapshot top-3 scores against current app/src/data.json: US gpt-5.5 90.03 / grok-4.20 89.29 / gemini-3.1-pro-preview 88.21; UK gpt-5.5 77.18 / gemini-3.1-pro-preview 76.18 / grok-4.20 75.07; Global gpt-5.5 83.60 / gemini-3.1-pro-preview 82.20 / grok-4.20 82.18.

Repo

  • .gitignore keeps the Python lib/ blanket ignore and adds an explicit !app/src/lib/ + !app/src/lib/** allowlist so app/src/lib/ is tracked while nested lib/ directories elsewhere stay ignored.

Verification

  • bun run lint — clean
  • bun run build — clean (Next.js 16 production build)
  • bun run start — SSR HTML on / contains the open-set banner with role="note", the snapshot pill, the sensitivity selector with aria-pressed × 5 and role="group", the country view selector with aria-pressed × 3 and role="group", per-model bootstrap intervals, and the "by PolicyEngine" pill. /paper renders the same chrome without the view selector.
  • Scoring math reconciled against policybench/analysis.py for the snapshot's top-5 models per country.

Test plan

  • CI passes
  • Visit / — confirm header drops "Top score" / "Leading" and the snapshot pill is visible
  • Switch sensitivity view to Positive cases → leaderboard reorders and intervals refresh
  • Switch to Binary only while on Global → leaderboard falls back to Main with a notice; the Binary-only button is aria-disabled
  • Visit /paper → same sticky header style without a view selector
  • Mobile width → bootstrap interval line wraps under each model row

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown

vercel Bot commented May 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench Ready Ready Preview, Comment May 6, 2026 2:30pm

Request Review

Adds the credibility-tightening tier-1 leaderboard changes from
docs/site_improvements_scope.md, plus a shared sticky header that the
paper page now reuses.

Header
- Extract SiteHeader from Hero. The new component owns the sticky
  brand + nav + view-selector + action-link layout and supports an
  alwaysExpanded mode for pages without an in-page hero.
- Hero refactored to wrap SiteHeader and pass the country-aware
  subtitle, stat strip, and snapshot pill as expandedContent. Drop the
  "Top score" stat and the "Leading: <model>" sidebar; the leaderboard
  itself is the canonical source for both.
- /paper uses SiteHeader with alwaysExpanded, no view selector, and a
  Benchmark action link. The page body keeps its eyebrow/buttons/iframe.

Open-set banner + snapshot pill
- Above the leaderboard, a warning-tinted note states that scenarios
  and reference outputs are public, so the public preview is open-set.
- Snapshot date pill (Snapshot 2026-05-01) appears in the hero stat row
  on the home page and next to the Manuscript eyebrow on /paper.

Sensitivity-view selector
- New segmented control with five views: Main, Amount only, Binary
  only, Positive cases, Zero cases. Selecting a view rescores models
  client-side from scenarioPredictions and reorders the leaderboard;
  the description for the active view appears next to the selector.
- New utilities under app/src/lib/:
  - scoring.ts ports score_single_prediction (mean of exact, within-1%,
    within-5%, within-10% for amount; classification accuracy for
    binary; output-group resolution for person-expanded variables).
    Verified against canonical analysis.py against the snapshot for
    both US and UK headline scopes.
  - sensitivity.ts builds the per-row score table from a DashboardBundle
    and aggregates output-group means -> country -> global, preserving
    the country-equal weighting. Sensitivity views filter rows before
    aggregation.

Bootstrap rank intervals
- bootstrap.ts implements the household-resampling bootstrap with a
  deterministic mulberry32 RNG (seed 42, 400 draws) and reports the
  95% score interval and the rank range for each model under the
  active sensitivity view.
- ModelLeaderboard renders Rank N(-M) - 95% L-U next to each model's
  point estimate, with a tooltip naming the bootstrap parameters.

Repo
- Move the python wheel-artifact lib/ rule in .gitignore to /lib/ and
  /lib64/ (top-level only) so app/src/lib/ is tracked.

Verification
- bun run lint - clean
- bun run build - clean (Next.js 16 production build)
- bun run start - SSR render of / contains the open-set banner, the
  snapshot pill, the five sensitivity selector chips, and per-model
  Rank/95% interval rows for all 12 models. /paper renders SiteHeader
  with the snapshot pill and Benchmark action link, no view selector.
- bootstrap.ts now sums per-row sums and counts directly when
  aggregating output-group means inside each draw, so the bootstrap
  estimator matches the canonical headline scoring rule (each row
  contributes equally to the output-group mean instead of being
  collapsed to a per-scenario mean first).
- modelScoresForView and bootstrapIntervals require every required
  country to have rows under the active sensitivity slice before
  returning a global ranking. ModelLeaderboard falls back to Main when
  a slice has no rows in one country (e.g. "Binary only" with no UK
  binary outputs) and surfaces a notice; sensitivity buttons that
  cannot apply globally are aria-disabled with a tooltip.
- Sensitivity selector and country view selector now expose role,
  aria-label, and aria-pressed state.
- SiteHeader collapsed nav items and action link are no longer
  keyboard-focusable while hidden (tabIndex=-1, aria-hidden).
- useScrollProgress no longer subscribes to scroll when
  alwaysExpanded, and DEFAULT_DRAWS is exported and used as the single
  source for the bootstrap draw count (400).
- .gitignore restores the Python lib/ blanket ignore and adds an
  explicit !app/src/lib/ + !app/src/lib/** allowlist so app/src/lib is
  tracked while nested lib/ directories elsewhere stay ignored.
- Sensitivity buttons that are aria-disabled for the Global view no
  longer fire onClick, and aria-pressed is force-cleared on those
  buttons so they cannot simultaneously claim selected and disabled
  state. Cursor changes to not-allowed too.
- Auto-fallback notice quotes the slice label (e.g. "Binary only")
  instead of inlining a lower-cased phrase, so proper-noun feel is
  preserved.
Sensitivity buttons that are unavailable for the Global view now use
the native disabled attribute (which removes them from the tab order
and lets the browser ignore Enter/Space presses), instead of relying
on aria-disabled + an undefined onClick. The aria-pressed force-clear
is preserved so the button never claims selected and disabled
simultaneously.
navVisible was set unconditionally on the home page even while the
in-page nav had opacity:0 / max-width:0, leaving the nav links
keyboard-focusable in the collapsed state. Tie navVisible to the same
navOpacity > 0.05 threshold the visual hide uses, so the links stay
out of the tab order until they are actually visible. Also dedupes
navOpacity (was declared twice).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant