Add historical-replay mode for benchmark fairness by smodee · Pull Request #24 · algorithmicgovernance/BioScanCast

smodee · 2026-05-20T12:48:51Z

Summary

Adds opt-in historical-replay mode for benchmarking against human forecasters. Activated by setting ForecastQuestion.as_of_date; None (default) preserves live behavior exactly.

Cutoff plumbing. as_of_date threaded through search and extraction. SearchResult and Document gain published_date_source, cutoff_applied, fetch_strategy, snapshot_timestamp for audit.
Search-side filter. Drops post-cutoff and undatable results. URL-slug + Wayback first-seen recover undated dates. Selective gate skips Wayback for aggregator/unknown-tier domains.
Tavily. Forwards start_date+end_date pair — empirically the only configuration Tavily actually honors (passing end_date alone is silently ignored). 12-month default lookback. Sub-queries get cutoff year appended.
Dashboards. Rewritten to closest Wayback snapshot ≤ cutoff. Suppressed entirely (not fallen back to live) if no pre-cutoff snapshot exists, since silent live-fallback would defeat the benchmark.
Extraction. Fetcher uses Wayback id_ snapshots in historical mode; falls back to live with strategy recorded on Document. RFC 2822 date parsing added for Tavily news payloads.
Wayback hardening. Proactive 2s throttle + retry/backoff on 429/5xx/timeout. Live smoke-test wall-clock went from ~30 min to ~49 s.
Eval diagnostics. eval_stage/contamination.py adds filter_caught_contamination_rate (lower bound, never silent) and retrieval_free_baseline_forecast (E4).

The Tavily date-window finding and the Wayback throttle/gate were each preceded by a small off-repo investigation; the probe and analyzer scripts in scripts/ are committed for reproducibility.

Closes #5.

Test plan

pytest bioscancast/tests/ — 267 passed, 2 skipped (live).
Live smoke (scripts/test_historical_replay.py) against q1 (H5N1 US, cutoff 2025-02-17): dashboards Wayback-rewritten to Feb 7/15 2025, 20/20 Tavily pre-cutoff hit rate per sub-query, no post-cutoff leaks, ~49 s wall-clock.
Reviewer: spot-check that as_of_date=None paths are unchanged. The pre-existing 252 tests still passing covers the live-mode regression surface; the 15 new tests cover historical-mode-specific behavior.
Reviewer: confirm the dashboard-suppression-on-Wayback-miss policy. The alternative is silent live-fallback, which defeats the benchmark's purpose.

Activated by setting ForecastQuestion.as_of_date; None (default) preserves live behavior unchanged. Core: * Schema fields published_date_source, cutoff_applied, fetch_strategy, snapshot_timestamp on SearchResult and Document for post-hoc audit * SearchStagePipeline filters post-cutoff results; recovers undated ones from URL slug or Wayback first-seen; drops the unrecoverable * lookup_dashboards rewrites URLs to closest Wayback snapshot at-or- before cutoff; suppresses dashboards with no pre-cutoff snapshot * ExtractionPipeline fetches via Wayback id_ snapshots when cutoff is set; falls back to live with strategy recorded on Document * SearchCache key incorporates as_of_date so replays don't collide * Optional historical_roleplay decomposition prompt * eval_stage/contamination.py adds filter_caught_contamination_rate (explicit lower bound, never silent) and retrieval_free baseline (E4) Hardening surfaced by live testing: * Tavily end_date accepted on the Protocol but not forwarded (verified empirically not to filter) * Wayback CDX retries on 5xx / 429 / timeout with exponential backoff * Sub-queries get cutoff year appended in historical mode * Top-up round with bigger max_results when survivors < threshold * RFC 2822 date parsing for Tavily news topic responses Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Tavily's news endpoint silently ignores `end_date` when passed alone but honors the start_date+end_date pair, returning 20/20 native pre-cutoff results across the resolved corpus (q1, q3, q7, q9). The pipeline now synthesizes `start_date = as_of_date - 365d` (configurable via `historical_lookback_days`) and forwards both bounds. The 0/20 pre-cutoff disaster on live testing of q1 is fully addressed by this change; the post-retrieval cutoff filter remains as defense-in-depth. The TavilyBackend drops a lone `end_date` with a warning rather than sending a request Tavily will misinterpret. Stale comments referring to "Tavily ignores end_date" are updated across pipeline.py and tavily_backend.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The Wayback CDX endpoint rate-limits at ~60 req/min server-side. The existing reactive RETRY_BACKOFF_SECONDS = (0, 10, 30, 90, 240) ladder only fires after the server has already started returning 429s, burning ~6 min per failure. Historical-replay benchmarks routinely hit dozens of these failures, producing ~30 min wall-clock on q1. This commit adds two complementary measures: 1. Proactive throttle in wayback.py: a module-level _throttle() gate sleeps before every urlopen to maintain a 2.0 s minimum interval (~30 req/min, half the server cap). Overridable via env var BIOSCANCAST_WAYBACK_MIN_INTERVAL_SECONDS. The retry ladder is unchanged and still handles genuine 503s / read timeouts. 2. Selective-recovery gate in pipeline._apply_cutoff_filter: skip the Wayback first-seen leg of recover_published_date() for aggregator domains (metaculus, manifold, kalshi, ...) and source_tier=="unknown". The URL-slug regex and Last-Modified strategies still run for gated results. New wayback_skipped counter in the cutoff-filter log line. Live q1 smoke test: ~30 min -> 49 s (~37x). Test suite: 252 -> 261 passed (9 new tests covering the throttle gate, env override, retry interaction, gate decisions, and end-to-end recovery routing). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

probe_tavily_topic.py grows from a one-query news/general comparison into a corpus iterator with config caching, synthetic-backdated stress queries, and per-knob result dumping. analyze_tavily_probe.py is new: it reads the cached probe payloads and recomputes hit-rate tables without re-paying the Tavily quota. Both scripts default to writing/reading specs/probe-results/ (gitignored by convention; create on first run). They were the workhorses behind the start_date+end_date investigation that produced commit 211f6df. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

smodee and others added 4 commits May 20, 2026 11:54

smodee requested a review from rapsoj May 20, 2026 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add historical-replay mode for benchmark fairness#24

Add historical-replay mode for benchmark fairness#24
smodee wants to merge 4 commits into
mainfrom
feat/as-of-date-replay

smodee commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smodee commented May 20, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant