Add judge-prompt registry with per-task defaults#40
Add judge-prompt registry with per-task defaults#40alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
Conversation
Introduces a small named registry of judge prompts in judgearena.prompts.registry so every benchmark gets a sensible default that is also recorded by hash in the run metadata. The four bundled presets are: - ``default`` -- score-only system + user pair (used by alpaca-eval, arena-hard-v0.1/v2.0, m-arena-hard*, fluency tasks fall through to ``fluency``). - ``default_with_explanation`` -- equivalent to today's ``--provide_explanation``. - ``fluency`` -- the inline fluency system prompt that previously lived inside ``generate_and_evaluate.py``, paired with the default user template. - ``fastchat-pairwise`` -- name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected by ``fastchat_compat._select_prompt``. ``TASK_DEFAULT_PRESET`` plus the prefix-aware ``default_preset_for_task`` decide which preset a task gets when the user does not pass ``--judge_prompt_preset``. Three new CLI flags select the prompt: - ``--judge_prompt_preset NAME`` -- pick a registered preset. - ``--judge_system_prompt_file`` / ``--judge_user_prompt_file`` -- fully custom prompts; both file overrides take precedence over the preset when set. ``--provide_explanation`` is preserved and is now equivalent to ``--judge_prompt_preset default_with_explanation``. ``judgearena/evaluate.py``::``resolve_judge_prompts`` becomes a thin backward-compatible shim around the registry, and a new ``resolve_run_judge_prompt(task, cli_args)`` returns the full ``ResolvedJudgePrompt`` (including SHAs/paths) so future PRs can fold that hash into cache keys and run metadata. The CLI dispatcher forwards the three new fields through to both ``CliArgs`` and ``CliEloArgs``. ``generate_and_evaluate.py`` is left unchanged: its existing ``resolve_judge_prompts`` call goes through the new shim and continues to behave exactly as before. Tests cover the registry resolution rules (task defaults, preset by name, file overrides, ``provide_explanation`` alias, ``fastchat-pairwise`` delegation) plus a CLI test that confirms ``--judge_prompt_preset`` is parsed and forwarded. Made-with: Cursor
| ) | ||
|
|
||
|
|
||
| def resolve_run_judge_prompt(task: str, cli_args) -> ResolvedJudgePrompt: |
There was a problem hiding this comment.
This looks like the intended single resolution path for the new CLI prompt options, but I can only find it referenced from the new unit test, not from production code.
There was a problem hiding this comment.
Yes! I'm working on a PR on top of this one that uses this across multiple files (generate_and_evaluate.py, estimate_elo_ratings.py, generate.py)
There was a problem hiding this comment.
PR #32 also rewrites this area, but it threads a richer resolved-prompt object with parser-mode metadata through the judging pipeline. This PR adds a second prompt-resolution abstraction in the same file. Can we unify on one resolver / one return type before merging? Otherwise this will conflict both textually and semantically with #32
There was a problem hiding this comment.
You have more context on the codebase so I'll let you decide what's the best approach!
There was a problem hiding this comment.
To provide some more context, the follow-up PR I'm working on passes the prompt, and all the generation and sampling parameters to the different models (ChatVLLM, ChatOpenAI, etc) and writes them to the metadata.
Summary
Introduces a small named registry of judge prompts in
judgearena.prompts.registryso every benchmark gets a sensible default that is also recorded by hash in the run metadata. Four bundled presets:default— score-only system + user pair (used byalpaca-eval,arena-hard-v0.1,arena-hard-v2.0,m-arena-hard*).default_with_explanation— equivalent to today's--provide_explanation.fluency— the inline fluency system prompt that previously lived insidegenerate_and_evaluate.py, paired with the default user template.fastchat-pairwise— name-only delegate for the MT-Bench pipeline, whose category-aware prompts continue to be selected byfastchat_compat._select_prompt.TASK_DEFAULT_PRESETplus the prefix-awaredefault_preset_for_taskdecide which preset a task gets when the user does not pass--judge_prompt_preset.Three new CLI flags select the prompt:
--judge_prompt_preset NAME— pick a registered preset.--judge_system_prompt_file/--judge_user_prompt_file— fully custom prompts; both file overrides take precedence over the preset when set.--provide_explanationis preserved and is now equivalent to--judge_prompt_preset default_with_explanation.judgearena/evaluate.py::resolve_judge_promptsbecomes a thin backward-compatible shim around the registry, and a newresolve_run_judge_prompt(task, cli_args)returns the fullResolvedJudgePrompt(including SHAs/paths) so follow-up PRs can fold that hash into cache keys and run metadata.The CLI dispatcher forwards the three new fields through to both
CliArgsandCliEloArgs.generate_and_evaluate.pyis intentionally left unchanged in this PR: its existingresolve_judge_promptscall goes through the new shim and continues to behave exactly as before — a follow-up PR will switch the call site to the task-awareresolve_run_judge_promptand drop the now-redundant inline fluency system prompt.Why
Today the judge prompt is hard-coded to two on-disk templates and the fluency variant lives as an inline
system_prompt = \"\"\"...\"\"\"string ingenerate_and_evaluate.py. There is no way to:only SHA-256 hashes after-the-fact);
This PR is the smallest piece of plumbing that gives us a named registry + per-task default mapping + CLI surface; subsequent PRs in the reproducibility-hardening stack will fold the resolved prompt SHA into cache keys and write the verbatim templates next to each run.
Test plan
uv run pytest -qtests/test_prompt_registry.pycovers task defaults, explicit presets, theprovide_explanationalias, file overrides winning over presets, thefastchat-pairwisedelegate, and error paths.tests/test_cli.py::test_judge_prompt_preset_flag_is_forwardedverifies the new flag survives CLI dispatch.