Skip to content

Add historical-replay mode for benchmark fairness#24

Open
smodee wants to merge 4 commits into
mainfrom
feat/as-of-date-replay
Open

Add historical-replay mode for benchmark fairness#24
smodee wants to merge 4 commits into
mainfrom
feat/as-of-date-replay

Conversation

@smodee
Copy link
Copy Markdown
Collaborator

@smodee smodee commented May 20, 2026

Summary

Adds opt-in historical-replay mode for benchmarking against human forecasters. Activated by setting ForecastQuestion.as_of_date; None (default) preserves live behavior exactly.

  • Cutoff plumbing. as_of_date threaded through search and extraction. SearchResult and Document gain published_date_source, cutoff_applied, fetch_strategy, snapshot_timestamp for audit.
  • Search-side filter. Drops post-cutoff and undatable results. URL-slug + Wayback first-seen recover undated dates. Selective gate skips Wayback for aggregator/unknown-tier domains.
  • Tavily. Forwards start_date+end_date pair — empirically the only configuration Tavily actually honors (passing end_date alone is silently ignored). 12-month default lookback. Sub-queries get cutoff year appended.
  • Dashboards. Rewritten to closest Wayback snapshot ≤ cutoff. Suppressed entirely (not fallen back to live) if no pre-cutoff snapshot exists, since silent live-fallback would defeat the benchmark.
  • Extraction. Fetcher uses Wayback id_ snapshots in historical mode; falls back to live with strategy recorded on Document. RFC 2822 date parsing added for Tavily news payloads.
  • Wayback hardening. Proactive 2s throttle + retry/backoff on 429/5xx/timeout. Live smoke-test wall-clock went from ~30 min to ~49 s.
  • Eval diagnostics. eval_stage/contamination.py adds filter_caught_contamination_rate (lower bound, never silent) and retrieval_free_baseline_forecast (E4).

The Tavily date-window finding and the Wayback throttle/gate were each preceded by a small off-repo investigation; the probe and analyzer scripts in scripts/ are committed for reproducibility.

Closes #5.

Test plan

  • pytest bioscancast/tests/ — 267 passed, 2 skipped (live).
  • Live smoke (scripts/test_historical_replay.py) against q1 (H5N1 US, cutoff 2025-02-17): dashboards Wayback-rewritten to Feb 7/15 2025, 20/20 Tavily pre-cutoff hit rate per sub-query, no post-cutoff leaks, ~49 s wall-clock.
  • Reviewer: spot-check that as_of_date=None paths are unchanged. The pre-existing 252 tests still passing covers the live-mode regression surface; the 15 new tests cover historical-mode-specific behavior.
  • Reviewer: confirm the dashboard-suppression-on-Wayback-miss policy. The alternative is silent live-fallback, which defeats the benchmark's purpose.

smodee and others added 4 commits May 20, 2026 11:54
Activated by setting ForecastQuestion.as_of_date; None (default)
preserves live behavior unchanged.

Core:
  * Schema fields published_date_source, cutoff_applied, fetch_strategy,
    snapshot_timestamp on SearchResult and Document for post-hoc audit
  * SearchStagePipeline filters post-cutoff results; recovers undated
    ones from URL slug or Wayback first-seen; drops the unrecoverable
  * lookup_dashboards rewrites URLs to closest Wayback snapshot at-or-
    before cutoff; suppresses dashboards with no pre-cutoff snapshot
  * ExtractionPipeline fetches via Wayback id_ snapshots when cutoff
    is set; falls back to live with strategy recorded on Document
  * SearchCache key incorporates as_of_date so replays don't collide
  * Optional historical_roleplay decomposition prompt
  * eval_stage/contamination.py adds filter_caught_contamination_rate
    (explicit lower bound, never silent) and retrieval_free baseline (E4)

Hardening surfaced by live testing:
  * Tavily end_date accepted on the Protocol but not forwarded
    (verified empirically not to filter)
  * Wayback CDX retries on 5xx / 429 / timeout with exponential backoff
  * Sub-queries get cutoff year appended in historical mode
  * Top-up round with bigger max_results when survivors < threshold
  * RFC 2822 date parsing for Tavily news topic responses

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tavily's news endpoint silently ignores `end_date` when passed alone but
honors the start_date+end_date pair, returning 20/20 native pre-cutoff
results across the resolved corpus (q1, q3, q7, q9). The pipeline now
synthesizes `start_date = as_of_date - 365d` (configurable via
`historical_lookback_days`) and forwards both bounds. The 0/20 pre-cutoff
disaster on live testing of q1 is fully addressed by this change; the
post-retrieval cutoff filter remains as defense-in-depth.

The TavilyBackend drops a lone `end_date` with a warning rather than
sending a request Tavily will misinterpret. Stale comments referring to
"Tavily ignores end_date" are updated across pipeline.py and
tavily_backend.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Wayback CDX endpoint rate-limits at ~60 req/min server-side. The
existing reactive RETRY_BACKOFF_SECONDS = (0, 10, 30, 90, 240) ladder
only fires after the server has already started returning 429s, burning
~6 min per failure. Historical-replay benchmarks routinely hit dozens
of these failures, producing ~30 min wall-clock on q1.

This commit adds two complementary measures:

1. Proactive throttle in wayback.py: a module-level _throttle() gate
   sleeps before every urlopen to maintain a 2.0 s minimum interval
   (~30 req/min, half the server cap). Overridable via env var
   BIOSCANCAST_WAYBACK_MIN_INTERVAL_SECONDS. The retry ladder is
   unchanged and still handles genuine 503s / read timeouts.

2. Selective-recovery gate in pipeline._apply_cutoff_filter: skip the
   Wayback first-seen leg of recover_published_date() for aggregator
   domains (metaculus, manifold, kalshi, ...) and source_tier=="unknown".
   The URL-slug regex and Last-Modified strategies still run for gated
   results. New wayback_skipped counter in the cutoff-filter log line.

Live q1 smoke test: ~30 min -> 49 s (~37x).
Test suite: 252 -> 261 passed (9 new tests covering the throttle gate,
env override, retry interaction, gate decisions, and end-to-end
recovery routing).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
probe_tavily_topic.py grows from a one-query news/general comparison
into a corpus iterator with config caching, synthetic-backdated stress
queries, and per-knob result dumping. analyze_tavily_probe.py is new:
it reads the cached probe payloads and recomputes hit-rate tables
without re-paying the Tavily quota.

Both scripts default to writing/reading specs/probe-results/ (gitignored
by convention; create on first run). They were the workhorses behind
the start_date+end_date investigation that produced commit 211f6df.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@smodee smodee requested a review from rapsoj May 20, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assess Tavily published_date reliability

1 participant