Add historical-replay mode for benchmark fairness#24
Open
smodee wants to merge 4 commits into
Open
Conversation
Activated by setting ForecastQuestion.as_of_date; None (default)
preserves live behavior unchanged.
Core:
* Schema fields published_date_source, cutoff_applied, fetch_strategy,
snapshot_timestamp on SearchResult and Document for post-hoc audit
* SearchStagePipeline filters post-cutoff results; recovers undated
ones from URL slug or Wayback first-seen; drops the unrecoverable
* lookup_dashboards rewrites URLs to closest Wayback snapshot at-or-
before cutoff; suppresses dashboards with no pre-cutoff snapshot
* ExtractionPipeline fetches via Wayback id_ snapshots when cutoff
is set; falls back to live with strategy recorded on Document
* SearchCache key incorporates as_of_date so replays don't collide
* Optional historical_roleplay decomposition prompt
* eval_stage/contamination.py adds filter_caught_contamination_rate
(explicit lower bound, never silent) and retrieval_free baseline (E4)
Hardening surfaced by live testing:
* Tavily end_date accepted on the Protocol but not forwarded
(verified empirically not to filter)
* Wayback CDX retries on 5xx / 429 / timeout with exponential backoff
* Sub-queries get cutoff year appended in historical mode
* Top-up round with bigger max_results when survivors < threshold
* RFC 2822 date parsing for Tavily news topic responses
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tavily's news endpoint silently ignores `end_date` when passed alone but honors the start_date+end_date pair, returning 20/20 native pre-cutoff results across the resolved corpus (q1, q3, q7, q9). The pipeline now synthesizes `start_date = as_of_date - 365d` (configurable via `historical_lookback_days`) and forwards both bounds. The 0/20 pre-cutoff disaster on live testing of q1 is fully addressed by this change; the post-retrieval cutoff filter remains as defense-in-depth. The TavilyBackend drops a lone `end_date` with a warning rather than sending a request Tavily will misinterpret. Stale comments referring to "Tavily ignores end_date" are updated across pipeline.py and tavily_backend.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Wayback CDX endpoint rate-limits at ~60 req/min server-side. The existing reactive RETRY_BACKOFF_SECONDS = (0, 10, 30, 90, 240) ladder only fires after the server has already started returning 429s, burning ~6 min per failure. Historical-replay benchmarks routinely hit dozens of these failures, producing ~30 min wall-clock on q1. This commit adds two complementary measures: 1. Proactive throttle in wayback.py: a module-level _throttle() gate sleeps before every urlopen to maintain a 2.0 s minimum interval (~30 req/min, half the server cap). Overridable via env var BIOSCANCAST_WAYBACK_MIN_INTERVAL_SECONDS. The retry ladder is unchanged and still handles genuine 503s / read timeouts. 2. Selective-recovery gate in pipeline._apply_cutoff_filter: skip the Wayback first-seen leg of recover_published_date() for aggregator domains (metaculus, manifold, kalshi, ...) and source_tier=="unknown". The URL-slug regex and Last-Modified strategies still run for gated results. New wayback_skipped counter in the cutoff-filter log line. Live q1 smoke test: ~30 min -> 49 s (~37x). Test suite: 252 -> 261 passed (9 new tests covering the throttle gate, env override, retry interaction, gate decisions, and end-to-end recovery routing). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
probe_tavily_topic.py grows from a one-query news/general comparison into a corpus iterator with config caching, synthetic-backdated stress queries, and per-knob result dumping. analyze_tavily_probe.py is new: it reads the cached probe payloads and recomputes hit-rate tables without re-paying the Tavily quota. Both scripts default to writing/reading specs/probe-results/ (gitignored by convention; create on first run). They were the workhorses behind the start_date+end_date investigation that produced commit 211f6df. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in historical-replay mode for benchmarking against human forecasters. Activated by setting
ForecastQuestion.as_of_date;None(default) preserves live behavior exactly.as_of_datethreaded through search and extraction.SearchResultandDocumentgainpublished_date_source,cutoff_applied,fetch_strategy,snapshot_timestampfor audit.start_date+end_datepair — empirically the only configuration Tavily actually honors (passingend_datealone is silently ignored). 12-month default lookback. Sub-queries get cutoff year appended.id_snapshots in historical mode; falls back to live with strategy recorded onDocument. RFC 2822 date parsing added for Tavily news payloads.eval_stage/contamination.pyaddsfilter_caught_contamination_rate(lower bound, never silent) andretrieval_free_baseline_forecast(E4).The Tavily date-window finding and the Wayback throttle/gate were each preceded by a small off-repo investigation; the probe and analyzer scripts in
scripts/are committed for reproducibility.Closes #5.
Test plan
pytest bioscancast/tests/— 267 passed, 2 skipped (live).scripts/test_historical_replay.py) against q1 (H5N1 US, cutoff 2025-02-17): dashboards Wayback-rewritten to Feb 7/15 2025, 20/20 Tavily pre-cutoff hit rate per sub-query, no post-cutoff leaks, ~49 s wall-clock.as_of_date=Nonepaths are unchanged. The pre-existing 252 tests still passing covers the live-mode regression surface; the 15 new tests cover historical-mode-specific behavior.