Determinism and reproducibility hardening

## Problem

The current benchmark mixes random-seeded scenario generation with stochastic LLM generation (`temperature=1.0`, `n_runs=10` per scenario) and unpinned dependencies. That gives us variance estimates per cell, but it makes the benchmark itself non-reproducible: re-running the same suite a month later — same code, same prompts — gives different numbers, and we can't tell whether a difference is the model drifting, the SDK changing, the scenario distribution shifting, or noise.

For a benchmark whose value is tracking model performance over time, the runs should be (mostly) reproducible: same inputs + same model snapshot → same outputs, up to provider-side API jitter we can't control.

## Concrete sources of nondeterminism today

1. **Model aliases, not snapshots.** `config.py` lists `"gpt-4o-mini"`, `"gemini-1.5-flash"` — these resolve to whatever each provider currently points the alias at. Snapshots like `gpt-4o-mini-2024-07-18` and `gemini-1.5-flash-002` are stable.
2. **Unpinned Python deps.** `pyproject.toml` declares `numpy`, `pandas`, `edsl`, `policyengine-us` without version bounds. A `policyengine-us` change to e.g. SNAP rules will silently change the ground truth.
3. **`temperature=1.0`.** Hardcoded in `llm_estimator.py:93`. With `temperature=0` (and OpenAI's `seed` parameter, plus Gemini's equivalent where supported), variance from the LLM side approaches zero on supporting providers, and `n_runs=10` becomes mostly redundant.
4. **No response cache.** Every benchmark run re-pays API cost even for unchanged (model, scenario) pairs. A SHA-keyed cache (model_id + prompt + params → response) makes re-runs free and makes "what actually changed" diff'able.
5. **Scenarios live in code, not data.** `households.generate_scenarios` is seeded with `RANDOM_SEED=42`, so they're stable as long as the generator function doesn't change — but if it does, the "same seed" produces a different population without anyone noticing.
6. **No provenance recorded.** The output CSV stores `model`, `scenario_index`, `ground_truth`, etc., but not edsl version, policyengine-us version, model snapshot strings, scenario hash, or total API calls. Hard to attribute drift between runs.

## Suggested changes (smallest → largest)

- [ ] Pin model snapshots in `config.MODELS` and document the deprecation horizon (each provider deprecates snapshots on a rolling schedule).
- [ ] Pin all deps in `pyproject.toml` with `>=` floors and `<` ceilings, or commit a `uv.lock` / `requirements.txt`.
- [ ] Set `temperature=0` and pass provider-specific seeds where available (`seed=42` on OpenAI, etc.). Drop `n_runs` to 1 by default; keep `n_runs > 1` as an opt-in for variance studies.
- [ ] Materialize scenarios: regenerate once with the seed, write `scenarios.json` to the repo, load from disk in `main.py`. Optional: also commit the generator commit hash so we know how it was produced.
- [ ] Add a response cache (e.g., a JSON or SQLite-keyed-by-sha256-of-canonical-request store). Bonus: makes the benchmark runnable offline against the cache.
- [ ] Emit a `provenance` block in `benchmark_output.csv` (or a sidecar JSON): edsl version, policyengine-us version, snapshot IDs, scenario file hash, run timestamp, total API calls, total cost.

## Reference

[`talkie-evals`](https://github.com/MaxGhenis/talkie-evals) does this for an LM evaluation suite — pins all model HF revisions, dataset revisions, the lm-evaluation-harness task YAMLs, the Modal image's pip packages, and the sample seed. Every result JSON contains the full provenance block (`talkie_evals_version`, `talkie_git_revision`, `model_revisions`, dataset revisions, `modal_pip_packages`). Same pattern would port directly here.

## Out of scope

- Replacing edsl. Worth a separate discussion (`inspect_ai` / `lm-evaluation-harness` / direct `litellm` wrapper) but orthogonal to determinism.
- Switching from free-text $-amount answers to bucketed multiple-choice — separate metric design discussion.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determinism and reproducibility hardening #6

Problem

Concrete sources of nondeterminism today

Suggested changes (smallest → largest)

Reference

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Determinism and reproducibility hardening #6

Description

Problem

Concrete sources of nondeterminism today

Suggested changes (smallest → largest)

Reference

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions