You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current benchmark mixes random-seeded scenario generation with stochastic LLM generation (temperature=1.0, n_runs=10 per scenario) and unpinned dependencies. That gives us variance estimates per cell, but it makes the benchmark itself non-reproducible: re-running the same suite a month later — same code, same prompts — gives different numbers, and we can't tell whether a difference is the model drifting, the SDK changing, the scenario distribution shifting, or noise.
For a benchmark whose value is tracking model performance over time, the runs should be (mostly) reproducible: same inputs + same model snapshot → same outputs, up to provider-side API jitter we can't control.
Concrete sources of nondeterminism today
Model aliases, not snapshots.config.py lists "gpt-4o-mini", "gemini-1.5-flash" — these resolve to whatever each provider currently points the alias at. Snapshots like gpt-4o-mini-2024-07-18 and gemini-1.5-flash-002 are stable.
Unpinned Python deps.pyproject.toml declares numpy, pandas, edsl, policyengine-us without version bounds. A policyengine-us change to e.g. SNAP rules will silently change the ground truth.
temperature=1.0. Hardcoded in llm_estimator.py:93. With temperature=0 (and OpenAI's seed parameter, plus Gemini's equivalent where supported), variance from the LLM side approaches zero on supporting providers, and n_runs=10 becomes mostly redundant.
No response cache. Every benchmark run re-pays API cost even for unchanged (model, scenario) pairs. A SHA-keyed cache (model_id + prompt + params → response) makes re-runs free and makes "what actually changed" diff'able.
Scenarios live in code, not data.households.generate_scenarios is seeded with RANDOM_SEED=42, so they're stable as long as the generator function doesn't change — but if it does, the "same seed" produces a different population without anyone noticing.
No provenance recorded. The output CSV stores model, scenario_index, ground_truth, etc., but not edsl version, policyengine-us version, model snapshot strings, scenario hash, or total API calls. Hard to attribute drift between runs.
Suggested changes (smallest → largest)
Pin model snapshots in config.MODELS and document the deprecation horizon (each provider deprecates snapshots on a rolling schedule).
Pin all deps in pyproject.toml with >= floors and < ceilings, or commit a uv.lock / requirements.txt.
Set temperature=0 and pass provider-specific seeds where available (seed=42 on OpenAI, etc.). Drop n_runs to 1 by default; keep n_runs > 1 as an opt-in for variance studies.
Materialize scenarios: regenerate once with the seed, write scenarios.json to the repo, load from disk in main.py. Optional: also commit the generator commit hash so we know how it was produced.
Add a response cache (e.g., a JSON or SQLite-keyed-by-sha256-of-canonical-request store). Bonus: makes the benchmark runnable offline against the cache.
Emit a provenance block in benchmark_output.csv (or a sidecar JSON): edsl version, policyengine-us version, snapshot IDs, scenario file hash, run timestamp, total API calls, total cost.
Reference
talkie-evals does this for an LM evaluation suite — pins all model HF revisions, dataset revisions, the lm-evaluation-harness task YAMLs, the Modal image's pip packages, and the sample seed. Every result JSON contains the full provenance block (talkie_evals_version, talkie_git_revision, model_revisions, dataset revisions, modal_pip_packages). Same pattern would port directly here.
Out of scope
Replacing edsl. Worth a separate discussion (inspect_ai / lm-evaluation-harness / direct litellm wrapper) but orthogonal to determinism.
Switching from free-text $-amount answers to bucketed multiple-choice — separate metric design discussion.
Problem
The current benchmark mixes random-seeded scenario generation with stochastic LLM generation (
temperature=1.0,n_runs=10per scenario) and unpinned dependencies. That gives us variance estimates per cell, but it makes the benchmark itself non-reproducible: re-running the same suite a month later — same code, same prompts — gives different numbers, and we can't tell whether a difference is the model drifting, the SDK changing, the scenario distribution shifting, or noise.For a benchmark whose value is tracking model performance over time, the runs should be (mostly) reproducible: same inputs + same model snapshot → same outputs, up to provider-side API jitter we can't control.
Concrete sources of nondeterminism today
config.pylists"gpt-4o-mini","gemini-1.5-flash"— these resolve to whatever each provider currently points the alias at. Snapshots likegpt-4o-mini-2024-07-18andgemini-1.5-flash-002are stable.pyproject.tomldeclaresnumpy,pandas,edsl,policyengine-uswithout version bounds. Apolicyengine-uschange to e.g. SNAP rules will silently change the ground truth.temperature=1.0. Hardcoded inllm_estimator.py:93. Withtemperature=0(and OpenAI'sseedparameter, plus Gemini's equivalent where supported), variance from the LLM side approaches zero on supporting providers, andn_runs=10becomes mostly redundant.households.generate_scenariosis seeded withRANDOM_SEED=42, so they're stable as long as the generator function doesn't change — but if it does, the "same seed" produces a different population without anyone noticing.model,scenario_index,ground_truth, etc., but not edsl version, policyengine-us version, model snapshot strings, scenario hash, or total API calls. Hard to attribute drift between runs.Suggested changes (smallest → largest)
config.MODELSand document the deprecation horizon (each provider deprecates snapshots on a rolling schedule).pyproject.tomlwith>=floors and<ceilings, or commit auv.lock/requirements.txt.temperature=0and pass provider-specific seeds where available (seed=42on OpenAI, etc.). Dropn_runsto 1 by default; keepn_runs > 1as an opt-in for variance studies.scenarios.jsonto the repo, load from disk inmain.py. Optional: also commit the generator commit hash so we know how it was produced.provenanceblock inbenchmark_output.csv(or a sidecar JSON): edsl version, policyengine-us version, snapshot IDs, scenario file hash, run timestamp, total API calls, total cost.Reference
talkie-evalsdoes this for an LM evaluation suite — pins all model HF revisions, dataset revisions, the lm-evaluation-harness task YAMLs, the Modal image's pip packages, and the sample seed. Every result JSON contains the full provenance block (talkie_evals_version,talkie_git_revision,model_revisions, dataset revisions,modal_pip_packages). Same pattern would port directly here.Out of scope
inspect_ai/lm-evaluation-harness/ directlitellmwrapper) but orthogonal to determinism.