DeepSeek is configured only as an experimental/manual model path for now, not as a default benchmark model or public pending leaderboard entry.
Reason for exclusion from the public leaderboard:
- Credentials work, and a raw DeepSeek Flash toy JSON call succeeds.
- A real PolicyBench one-household, one-output
deepseek-v4-flash smoke completed, but took about 109 seconds and used 8,207 reasoning tokens for one requested output.
deepseek-v4-pro did not finish the same one-household, one-output smoke within 150 seconds.
- Earlier full multi-output PolicyBench attempts timed out or hung.
Current repo policy:
- DeepSeek remains in
EXPERIMENTAL_MODELS so it can be manually probed if the API behavior improves.
- DeepSeek is excluded from default runs, app model metadata, and public pending-model messaging.
Acceptance criteria before adding it to the public leaderboard:
- Complete a representative multi-output smoke without manual repair or process intervention.
- Produce parseable JSON with nonempty explanations.
- Have latency and token usage that make a 100-household run practical.
- Pass the same analysis/rebuild path as other public leaderboard models.
DeepSeek is configured only as an experimental/manual model path for now, not as a default benchmark model or public pending leaderboard entry.
Reason for exclusion from the public leaderboard:
deepseek-v4-flashsmoke completed, but took about 109 seconds and used 8,207 reasoning tokens for one requested output.deepseek-v4-prodid not finish the same one-household, one-output smoke within 150 seconds.Current repo policy:
EXPERIMENTAL_MODELSso it can be manually probed if the API behavior improves.Acceptance criteria before adding it to the public leaderboard: