Skip to content

Keep DeepSeek out of public leaderboard until benchmark requests complete reliably #4

@MaxGhenis

Description

@MaxGhenis

DeepSeek is configured only as an experimental/manual model path for now, not as a default benchmark model or public pending leaderboard entry.

Reason for exclusion from the public leaderboard:

  • Credentials work, and a raw DeepSeek Flash toy JSON call succeeds.
  • A real PolicyBench one-household, one-output deepseek-v4-flash smoke completed, but took about 109 seconds and used 8,207 reasoning tokens for one requested output.
  • deepseek-v4-pro did not finish the same one-household, one-output smoke within 150 seconds.
  • Earlier full multi-output PolicyBench attempts timed out or hung.

Current repo policy:

  • DeepSeek remains in EXPERIMENTAL_MODELS so it can be manually probed if the API behavior improves.
  • DeepSeek is excluded from default runs, app model metadata, and public pending-model messaging.

Acceptance criteria before adding it to the public leaderboard:

  • Complete a representative multi-output smoke without manual repair or process intervention.
  • Produce parseable JSON with nonempty explanations.
  • Have latency and token usage that make a 100-household run practical.
  • Pass the same analysis/rebuild path as other public leaderboard models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions