Skip to content

Add judge-prompt registry with per-task defaults#40

Open
alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
alexrs-cohere:feat/judge-prompt-registry
Open

Add judge-prompt registry with per-task defaults#40
alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
alexrs-cohere:feat/judge-prompt-registry

Conversation

@alexrs-cohere
Copy link
Copy Markdown

@alexrs-cohere alexrs-cohere commented Apr 29, 2026

Summary

Introduces a small named registry of judge prompts in judgearena.prompts.registry so every benchmark gets a sensible default that is also recorded by hash in the run metadata. Four bundled presets:

  • default — score-only system + user pair (used by alpaca-eval, arena-hard-v0.1, arena-hard-v2.0, m-arena-hard*).
  • default_with_explanation — equivalent to today's --provide_explanation.
  • fluency — the inline fluency system prompt that previously lived inside generate_and_evaluate.py, paired with the default user template.
  • fastchat-pairwise — name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected by fastchat_compat._select_prompt.

TASK_DEFAULT_PRESET plus the prefix-aware default_preset_for_task decide which preset a task gets when the user does not pass --judge_prompt_preset.

Three new CLI flags select the prompt:

  • --judge_prompt_preset NAME — pick a registered preset.
  • --judge_system_prompt_file / --judge_user_prompt_file — fully custom prompts; both file overrides take precedence over the preset when set.

--provide_explanation is preserved and is now equivalent to
--judge_prompt_preset default_with_explanation.

judgearena/evaluate.py::resolve_judge_prompts becomes a thin backward-compatible shim around the registry, and a new resolve_run_judge_prompt(task, cli_args) returns the full ResolvedJudgePrompt (including SHAs/paths) so follow-up PRs can fold that hash into cache keys and run metadata.

The CLI dispatcher forwards the three new fields through to both CliArgs and CliEloArgs. generate_and_evaluate.py is intentionally left unchanged in this PR: its existing resolve_judge_prompts call goes through the new shim and continues to behave exactly as before — a follow-up PR will switch the call site to the task-aware resolve_run_judge_prompt and drop the now-redundant inline fluency system prompt.

Why

Today the judge prompt is hard-coded to two on-disk templates and the fluency variant lives as an inline system_prompt = \"\"\"...\"\"\" string in generate_and_evaluate.py. There is no way to:

  1. record which prompt a run actually used (the metadata bundle stores
    only SHA-256 hashes after-the-fact);
  2. swap prompts per task without editing the library;
  3. carry a reproducible reference into a re-run.

This PR is the smallest piece of plumbing that gives us a named registry + per-task default mapping + CLI surface; subsequent PRs in the reproducibility-hardening stack will fold the resolved prompt SHA into cache keys and write the verbatim templates next to each run.

Test plan

  • uv run pytest -q
  • tests/test_prompt_registry.py covers task defaults, explicit presets, the provide_explanation alias, file overrides winning over presets, the fastchat-pairwise delegate, and error paths.
  • tests/test_cli.py::test_judge_prompt_preset_flag_is_forwarded verifies the new flag survives CLI dispatch.
  • CI runs the full suite.

Introduces a small named registry of judge prompts in
judgearena.prompts.registry so every benchmark gets a sensible default
that is also recorded by hash in the run metadata. The four bundled
presets are:

- ``default`` -- score-only system + user pair (used by alpaca-eval,
  arena-hard-v0.1/v2.0, m-arena-hard*, fluency tasks fall through to
  ``fluency``).
- ``default_with_explanation`` -- equivalent to today's
  ``--provide_explanation``.
- ``fluency`` -- the inline fluency system prompt that previously lived
  inside ``generate_and_evaluate.py``, paired with the default user
  template.
- ``fastchat-pairwise`` -- name-only delegate for the MT-Bench pipeline,
  whose category-aware prompts continue to be selected by
  ``fastchat_compat._select_prompt``.

``TASK_DEFAULT_PRESET`` plus the prefix-aware
``default_preset_for_task`` decide which preset a task gets when the
user does not pass ``--judge_prompt_preset``.

Three new CLI flags select the prompt:

- ``--judge_prompt_preset NAME`` -- pick a registered preset.
- ``--judge_system_prompt_file`` / ``--judge_user_prompt_file`` --
  fully custom prompts; both file overrides take precedence over the
  preset when set.

``--provide_explanation`` is preserved and is now equivalent to
``--judge_prompt_preset default_with_explanation``.

``judgearena/evaluate.py``::``resolve_judge_prompts`` becomes a thin
backward-compatible shim around the registry, and a new
``resolve_run_judge_prompt(task, cli_args)`` returns the full
``ResolvedJudgePrompt`` (including SHAs/paths) so future PRs can fold
that hash into cache keys and run metadata.

The CLI dispatcher forwards the three new fields through to both
``CliArgs`` and ``CliEloArgs``. ``generate_and_evaluate.py`` is left
unchanged: its existing ``resolve_judge_prompts`` call goes through the
new shim and continues to behave exactly as before.

Tests cover the registry resolution rules (task defaults, preset by
name, file overrides, ``provide_explanation`` alias, ``fastchat-pairwise``
delegation) plus a CLI test that confirms ``--judge_prompt_preset`` is
parsed and forwarded.

Made-with: Cursor
@geoalgo geoalgo requested a review from ErlisLushtaku April 29, 2026 16:21
Comment thread judgearena/evaluate.py
)


def resolve_run_judge_prompt(task: str, cli_args) -> ResolvedJudgePrompt:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the intended single resolution path for the new CLI prompt options, but I can only find it referenced from the new unit test, not from production code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I'm working on a PR on top of this one that uses this across multiple files (generate_and_evaluate.py, estimate_elo_ratings.py, generate.py)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #32 also rewrites this area, but it threads a richer resolved-prompt object with parser-mode metadata through the judging pipeline. This PR adds a second prompt-resolution abstraction in the same file. Can we unify on one resolver / one return type before merging? Otherwise this will conflict both textually and semantically with #32

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have more context on the codebase so I'll let you decide what's the best approach!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide some more context, the follow-up PR I'm working on passes the prompt, and all the generation and sampling parameters to the different models (ChatVLLM, ChatOpenAI, etc) and writes them to the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants