Add judge-prompt registry with per-task defaults by alexrs-cohere · Pull Request #40 · OpenEuroLLM/JudgeArena

alexrs-cohere · 2026-04-29T09:53:38Z

Summary

Introduces a small named registry of judge prompts in judgearena.prompts.registry so every benchmark gets a sensible default that is also recorded by hash in the run metadata. Four bundled presets:

default — score-only system + user pair (used by alpaca-eval, arena-hard-v0.1, arena-hard-v2.0, m-arena-hard*).
default_with_explanation — equivalent to today's --provide_explanation.
fluency — the inline fluency system prompt that previously lived inside generate_and_evaluate.py, paired with the default user template.
fastchat-pairwise — name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected by fastchat_compat._select_prompt.

TASK_DEFAULT_PRESET plus the prefix-aware default_preset_for_task decide which preset a task gets when the user does not pass --judge_prompt_preset.

Three new CLI flags select the prompt:

--judge_prompt_preset NAME — pick a registered preset.
--judge_system_prompt_file / --judge_user_prompt_file — fully custom prompts; both file overrides take precedence over the preset when set.

--provide_explanation is preserved and is now equivalent to
--judge_prompt_preset default_with_explanation.

judgearena/evaluate.py::resolve_judge_prompts becomes a thin backward-compatible shim around the registry, and a new resolve_run_judge_prompt(task, cli_args) returns the full ResolvedJudgePrompt (including SHAs/paths) so follow-up PRs can fold that hash into cache keys and run metadata.

The CLI dispatcher forwards the three new fields through to both CliArgs and CliEloArgs. generate_and_evaluate.py is intentionally left unchanged in this PR: its existing resolve_judge_prompts call goes through the new shim and continues to behave exactly as before — a follow-up PR will switch the call site to the task-aware resolve_run_judge_prompt and drop the now-redundant inline fluency system prompt.

Why

Today the judge prompt is hard-coded to two on-disk templates and the fluency variant lives as an inline system_prompt = \"\"\"...\"\"\" string in generate_and_evaluate.py. There is no way to:

record which prompt a run actually used (the metadata bundle stores
only SHA-256 hashes after-the-fact);
swap prompts per task without editing the library;
carry a reproducible reference into a re-run.

This PR is the smallest piece of plumbing that gives us a named registry + per-task default mapping + CLI surface; subsequent PRs in the reproducibility-hardening stack will fold the resolved prompt SHA into cache keys and write the verbatim templates next to each run.

Test plan

uv run pytest -q
tests/test_prompt_registry.py covers task defaults, explicit presets, the provide_explanation alias, file overrides winning over presets, the fastchat-pairwise delegate, and error paths.
tests/test_cli.py::test_judge_prompt_preset_flag_is_forwarded verifies the new flag survives CLI dispatch.
CI runs the full suite.

Introduces a small named registry of judge prompts in judgearena.prompts.registry so every benchmark gets a sensible default that is also recorded by hash in the run metadata. The four bundled presets are: - ``default`` -- score-only system + user pair (used by alpaca-eval, arena-hard-v0.1/v2.0, m-arena-hard*, fluency tasks fall through to ``fluency``). - ``default_with_explanation`` -- equivalent to today's ``--provide_explanation``. - ``fluency`` -- the inline fluency system prompt that previously lived inside ``generate_and_evaluate.py``, paired with the default user template. - ``fastchat-pairwise`` -- name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected by ``fastchat_compat._select_prompt``. ``TASK_DEFAULT_PRESET`` plus the prefix-aware ``default_preset_for_task`` decide which preset a task gets when the user does not pass ``--judge_prompt_preset``. Three new CLI flags select the prompt: - ``--judge_prompt_preset NAME`` -- pick a registered preset. - ``--judge_system_prompt_file`` / ``--judge_user_prompt_file`` -- fully custom prompts; both file overrides take precedence over the preset when set. ``--provide_explanation`` is preserved and is now equivalent to ``--judge_prompt_preset default_with_explanation``. ``judgearena/evaluate.py``::``resolve_judge_prompts`` becomes a thin backward-compatible shim around the registry, and a new ``resolve_run_judge_prompt(task, cli_args)`` returns the full ``ResolvedJudgePrompt`` (including SHAs/paths) so future PRs can fold that hash into cache keys and run metadata. The CLI dispatcher forwards the three new fields through to both ``CliArgs`` and ``CliEloArgs``. ``generate_and_evaluate.py`` is left unchanged: its existing ``resolve_judge_prompts`` call goes through the new shim and continues to behave exactly as before. Tests cover the registry resolution rules (task defaults, preset by name, file overrides, ``provide_explanation`` alias, ``fastchat-pairwise`` delegation) plus a CLI test that confirms ``--judge_prompt_preset`` is parsed and forwarded. Made-with: Cursor

ErlisLushtaku · 2026-04-30T13:28:52Z

    )


+def resolve_run_judge_prompt(task: str, cli_args) -> ResolvedJudgePrompt:


This looks like the intended single resolution path for the new CLI prompt options, but I can only find it referenced from the new unit test, not from production code.

Yes! I'm working on a PR on top of this one that uses this across multiple files (generate_and_evaluate.py, estimate_elo_ratings.py, generate.py)

ErlisLushtaku · 2026-04-30T13:30:50Z

PR #32 also rewrites this area, but it threads a richer resolved-prompt object with parser-mode metadata through the judging pipeline. This PR adds a second prompt-resolution abstraction in the same file. Can we unify on one resolver / one return type before merging? Otherwise this will conflict both textually and semantically with #32

You have more context on the codebase so I'll let you decide what's the best approach!

To provide some more context, the follow-up PR I'm working on passes the prompt, and all the generation and sampling parameters to the different models (ChatVLLM, ChatOpenAI, etc) and writes them to the metadata.

geoalgo requested a review from ErlisLushtaku April 29, 2026 16:21

ErlisLushtaku reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add judge-prompt registry with per-task defaults#40

Add judge-prompt registry with per-task defaults#40
alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
alexrs-cohere:feat/judge-prompt-registry

alexrs-cohere commented Apr 29, 2026 •

edited

Loading

Uh oh!

ErlisLushtaku Apr 30, 2026

Uh oh!

alexrs-cohere Apr 30, 2026

Uh oh!

ErlisLushtaku Apr 30, 2026

Uh oh!

alexrs-cohere Apr 30, 2026

Uh oh!

alexrs-cohere Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		)


		def resolve_run_judge_prompt(task: str, cli_args) -> ResolvedJudgePrompt:

Conversation

alexrs-cohere commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

ErlisLushtaku Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alexrs-cohere Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alexrs-cohere Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

alexrs-cohere Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexrs-cohere commented Apr 29, 2026 •

edited

Loading