Skip to content

Determinism and reproducibility hardening #6

@MaxGhenis

Description

@MaxGhenis

Problem

The current benchmark mixes random-seeded scenario generation with stochastic LLM generation (temperature=1.0, n_runs=10 per scenario) and unpinned dependencies. That gives us variance estimates per cell, but it makes the benchmark itself non-reproducible: re-running the same suite a month later — same code, same prompts — gives different numbers, and we can't tell whether a difference is the model drifting, the SDK changing, the scenario distribution shifting, or noise.

For a benchmark whose value is tracking model performance over time, the runs should be (mostly) reproducible: same inputs + same model snapshot → same outputs, up to provider-side API jitter we can't control.

Concrete sources of nondeterminism today

  1. Model aliases, not snapshots. config.py lists "gpt-4o-mini", "gemini-1.5-flash" — these resolve to whatever each provider currently points the alias at. Snapshots like gpt-4o-mini-2024-07-18 and gemini-1.5-flash-002 are stable.
  2. Unpinned Python deps. pyproject.toml declares numpy, pandas, edsl, policyengine-us without version bounds. A policyengine-us change to e.g. SNAP rules will silently change the ground truth.
  3. temperature=1.0. Hardcoded in llm_estimator.py:93. With temperature=0 (and OpenAI's seed parameter, plus Gemini's equivalent where supported), variance from the LLM side approaches zero on supporting providers, and n_runs=10 becomes mostly redundant.
  4. No response cache. Every benchmark run re-pays API cost even for unchanged (model, scenario) pairs. A SHA-keyed cache (model_id + prompt + params → response) makes re-runs free and makes "what actually changed" diff'able.
  5. Scenarios live in code, not data. households.generate_scenarios is seeded with RANDOM_SEED=42, so they're stable as long as the generator function doesn't change — but if it does, the "same seed" produces a different population without anyone noticing.
  6. No provenance recorded. The output CSV stores model, scenario_index, ground_truth, etc., but not edsl version, policyengine-us version, model snapshot strings, scenario hash, or total API calls. Hard to attribute drift between runs.

Suggested changes (smallest → largest)

  • Pin model snapshots in config.MODELS and document the deprecation horizon (each provider deprecates snapshots on a rolling schedule).
  • Pin all deps in pyproject.toml with >= floors and < ceilings, or commit a uv.lock / requirements.txt.
  • Set temperature=0 and pass provider-specific seeds where available (seed=42 on OpenAI, etc.). Drop n_runs to 1 by default; keep n_runs > 1 as an opt-in for variance studies.
  • Materialize scenarios: regenerate once with the seed, write scenarios.json to the repo, load from disk in main.py. Optional: also commit the generator commit hash so we know how it was produced.
  • Add a response cache (e.g., a JSON or SQLite-keyed-by-sha256-of-canonical-request store). Bonus: makes the benchmark runnable offline against the cache.
  • Emit a provenance block in benchmark_output.csv (or a sidecar JSON): edsl version, policyengine-us version, snapshot IDs, scenario file hash, run timestamp, total API calls, total cost.

Reference

talkie-evals does this for an LM evaluation suite — pins all model HF revisions, dataset revisions, the lm-evaluation-harness task YAMLs, the Modal image's pip packages, and the sample seed. Every result JSON contains the full provenance block (talkie_evals_version, talkie_git_revision, model_revisions, dataset revisions, modal_pip_packages). Same pattern would port directly here.

Out of scope

  • Replacing edsl. Worth a separate discussion (inspect_ai / lm-evaluation-harness / direct litellm wrapper) but orthogonal to determinism.
  • Switching from free-text $-amount answers to bucketed multiple-choice — separate metric design discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions