Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c6b2b0a
update dependencies to support Qwen 3.5
ErlisLushtaku Mar 31, 2026
1f4bae8
slurmpilot scripts
ErlisLushtaku Apr 1, 2026
25b0355
update dep versions
ErlisLushtaku Apr 1, 2026
ab065fd
fix support for VLLM
ErlisLushtaku Apr 6, 2026
ef1c92c
remove qwen35 smoke launcher
ErlisLushtaku Apr 6, 2026
32f2e7e
use json schema structured outputs, tighten vllm range
ErlisLushtaku Apr 7, 2026
5f2edf0
fix formatting
ErlisLushtaku Apr 7, 2026
cffb6dd
Fix Qwen3.5 with mt-bench
ErlisLushtaku Apr 7, 2026
ac243aa
use latest vllm with thinking tokens limits and thinking field in the…
ErlisLushtaku Apr 13, 2026
41298a4
Merge remote-tracking branch 'origin/main' into erlislushtaku/fix/sup…
ErlisLushtaku Apr 14, 2026
319050d
thinking token handling improvements, mt-bench improvements, use mt-b…
ErlisLushtaku Apr 14, 2026
cb7ada5
Revert to free form generation, and use thinking token budget with re…
ErlisLushtaku Apr 15, 2026
84faa05
revert unnecessary changes and relics from earlier trials
ErlisLushtaku Apr 17, 2026
c063f3d
delete slurmpilot script
ErlisLushtaku Apr 17, 2026
ec7fc95
Revert comment removal
ErlisLushtaku Apr 17, 2026
20ca9a5
simplify and revert unnecessary changes
ErlisLushtaku Apr 17, 2026
217dc8d
Support Skywork
ErlisLushtaku Apr 17, 2026
8087c15
Add judge input character truncation and model length configurations …
ErlisLushtaku Apr 20, 2026
91d67ef
add llmcompressor dev dependency for quantization
ErlisLushtaku Apr 20, 2026
5e8efc9
Update baseline handling for Arena-Hard datasets
ErlisLushtaku Apr 21, 2026
2af4714
Add m-arenahard-v2.0
ErlisLushtaku Apr 21, 2026
da6818e
add default baseline for mt-bench
ErlisLushtaku Apr 21, 2026
891c417
handle prohibited content errors for gemini in openrouter
ErlisLushtaku Apr 21, 2026
fb36154
update system prompt with alpaca eval version, fix mismatch for expec…
ErlisLushtaku Apr 22, 2026
f33f191
roll back to the default system prompt
ErlisLushtaku Apr 28, 2026
e21639e
update dependencies for qwen3.5 and gemma4 runs
ErlisLushtaku Apr 28, 2026
157d939
Merge origin/main into support-qwen-3.5
ErlisLushtaku Apr 28, 2026
41d925e
Improve pairwise benchmark run controls and accounting
ErlisLushtaku Apr 29, 2026
16dc5e1
Clean up judge argument handling
ErlisLushtaku Apr 29, 2026
5411ff8
Add default score-based verdict mode for fastchat
ErlisLushtaku Apr 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 27 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,20 +197,36 @@ Task names follow [LMHarness](https://github.com/EleutherAI/lm-evaluation-harnes

### Generate + judge (pairwise)

| Task | Description |
|-----------------------|------------------------------------------------------------------------------------------------|
| `alpaca-eval` | General instruction-following benchmark |
| `arena-hard-v2.0` | Arena-Hard v2.0 from official `lmarena-ai/arena-hard-auto` source |
| `arena-hard-v0.1` | Legacy Arena-Hard v0.1 from official `lmarena-ai/arena-hard-auto` source |
| `m-arena-hard` | Translated version of Arena-Hard in 23 languages |
| `m-arena-hard-{lang}` | Language-specific variants (e.g., `ar`, `cs`, `de`) |
| `m-arena-hard-EU` | All EU languages combined |
| `mt-bench` | Multi-turn benchmark with FastChat-compatible pairwise judging |
| `fluency-{lang}` | Fluency evaluation for pretrained models (`finnish`, `french`, `german`, `spanish`, `swedish`) |
| Task | Description |
|------------------------------|------------------------------------------------------------------------------------------------|
| `alpaca-eval` | General instruction-following benchmark |
| `arena-hard-v2.0` | Arena-Hard v2.0 from official `lmarena-ai/arena-hard-auto` source |
| `arena-hard-v0.1` | Legacy Arena-Hard v0.1 from official `lmarena-ai/arena-hard-auto` source |
| `m-arena-hard-v0.1` | `CohereLabs/m-ArenaHard` (500 prompts, Google-Translate) across 23 languages |
| `m-arena-hard-v0.1-{lang}` | Language-specific v0.1 slice (e.g., `ar`, `cs`, `de`, `uk`, `zh`, `pl`) |
| `m-arena-hard-v0.1-EU` | All EU v0.1 languages combined |
| `m-arena-hard-v2.0` | `CohereLabs/m-ArenaHard-v2.0` (498 prompts, in-house translation) across 23 languages |
| `m-arena-hard-v2.0-{lang}` | Language-specific v2.0 slice |
| `m-arena-hard-v2.0-EU` | All EU v2.0 languages combined |
| `mt-bench` | Multi-turn benchmark with FastChat-compatible pairwise judging |
| `fluency-{lang}` | Fluency evaluation for pretrained models (`finnish`, `french`, `german`, `spanish`, `swedish`) |

For MT-Bench, the default pairwise baseline is `gpt-4`.
We diverge from FastChat's own `pairwise-baseline` default (`gpt-3.5-turbo`) to keep
a stronger reference consistent with Arena-Hard v0.1; the `gpt-4.jsonl` completions
ship in the `lmsys/mt-bench` HF Space. Override per run with `--model_B`.

For Arena-Hard, JudgeArena resolves baseline metadata by task version:
- `arena-hard-v0.1`: `gpt-4-0314`
- `arena-hard-v2.0`: `o3-mini-2025-01-31` (standard prompts)
- `arena-hard-v2.0`: per-question baseline routed by `category`:
- `o3-mini-2025-01-31` for `hard_prompt`, `coding`, and `math` (500 prompts).
- `gemini-2.0-flash-001` for `creative_writing` (250 prompts).

For m-Arena-Hard, baseline completions are tied to the benchmark release:
- `m-arena-hard-v0.1`: Aya Expanse 8B (`CohereLabs/aya-expanse-8b`), ingested
from `CohereLabs/deja-vu-pairwise-evals` (repeat 0) via
[`scripts/multilingual_arena_hard/ingest_deja_vu_aya_references.py`](scripts/multilingual_arena_hard/ingest_deja_vu_aya_references.py).
- `m-arena-hard-v2.0`: Gemini 2.5 Flash (`google/gemini-2.5-flash`).

### ELO rating

Expand Down
15 changes: 15 additions & 0 deletions judgearena/chat_models/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
"""Chat-model adapters with provider-specific hardening."""

from judgearena.chat_models.openrouter_gemini import (
GEMINI_SAFETY_REFUSAL_MARKER,
OPENROUTER_GEMINI_SAFETY_REFUSAL_FINISH_REASON,
OpenRouterGeminiSafetyTolerantChatOpenAI,
is_openrouter_gemini_model,
)

__all__ = [
"GEMINI_SAFETY_REFUSAL_MARKER",
"OPENROUTER_GEMINI_SAFETY_REFUSAL_FINISH_REASON",
"OpenRouterGeminiSafetyTolerantChatOpenAI",
"is_openrouter_gemini_model",
]
102 changes: 102 additions & 0 deletions judgearena/chat_models/openrouter_gemini.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
"""ChatOpenAI subclass tolerant to Gemini's PROHIBITED_CONTENT hard-refusals.

Google's core policy filter rejects a small fraction of prompts (e.g. graphic
violence, sexual content involving minors) with HTTP 403 ``PROHIBITED_CONTENT``
*regardless* of the adjustable ``safety_settings`` thresholds. These refusals
are legitimate, reproducible model behavior that a benchmark like
``m-arena-hard-v2.0`` surfaces: the baseline should contain them so the judge
can score them, not crash the run.

The subclass intercepts the error response before LangChain raises, returns
a stub assistant message with a clearly marked refusal payload and
``finish_reason="content_filter"``, and lets the rest of the pipeline proceed
unchanged.
"""

from __future__ import annotations

from typing import Any

from langchain_openai import ChatOpenAI

GEMINI_SAFETY_REFUSAL_MARKER = (
"[Gemini safety refusal: PROHIBITED_CONTENT — Google's core policy filter "
"blocked this prompt regardless of safety_settings.]"
)
OPENROUTER_GEMINI_SAFETY_REFUSAL_FINISH_REASON = "content_filter"

_PROHIBITED_CONTENT_TOKEN = "PROHIBITED_CONTENT"


def is_openrouter_gemini_model(model_spec: str) -> bool:
"""Return True when ``model_spec`` targets a Gemini model via OpenRouter.

Matches ``OpenRouter/google/gemini-2.5-flash`` and related variants.
"""
provider, sep, model_name = model_spec.partition("/")
if not sep:
return False
lowered = model_name.lower()
return provider == "OpenRouter" and (
lowered.startswith("google/gemini") or lowered.startswith("google/gemma")
)


def _error_is_prohibited_content(error: object) -> bool:
if error is None:
return False
return _PROHIBITED_CONTENT_TOKEN in str(error)


def _build_prohibited_content_stub_payload(
*, original_response: dict[str, Any], model_name: str
) -> dict[str, Any]:
stub_message = {
"role": "assistant",
"content": GEMINI_SAFETY_REFUSAL_MARKER,
}
stub_choice = {
"index": 0,
"message": stub_message,
"finish_reason": OPENROUTER_GEMINI_SAFETY_REFUSAL_FINISH_REASON,
}
return {
"id": original_response.get("id") or "openrouter-gemini-safety-stub",
"object": "chat.completion",
"created": original_response.get("created") or 0,
"model": original_response.get("model") or model_name,
"choices": [stub_choice],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 0,
},
}


class OpenRouterGeminiSafetyTolerantChatOpenAI(ChatOpenAI):
"""ChatOpenAI that converts Gemini PROHIBITED_CONTENT errors to stubs.

Only intercepts the specific OpenRouter error surface for Gemini's core
policy filter; all other errors propagate unchanged. The stub message has
``content == GEMINI_SAFETY_REFUSAL_MARKER`` and ``finish_reason ==
"content_filter"`` so upstream validators and judges see the refusal
explicitly rather than a silent drop.
"""

def _create_chat_result( # type: ignore[override]
self,
response,
generation_info: dict | None = None,
):
response_dict = (
response if isinstance(response, dict) else response.model_dump()
)
error = response_dict.get("error")
if _error_is_prohibited_content(error):
stub = _build_prohibited_content_stub_payload(
original_response=response_dict,
model_name=self.model_name,
)
return super()._create_chat_result(stub, generation_info=generation_info)
return super()._create_chat_result(response, generation_info=generation_info)
22 changes: 20 additions & 2 deletions judgearena/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
)
from judgearena.estimate_elo_ratings import CliEloArgs
from judgearena.estimate_elo_ratings import main as main_elo
from judgearena.generate_and_evaluate import CliArgs
from judgearena.generate_and_evaluate import CliArgs, native_pairwise_baseline
from judgearena.generate_and_evaluate import main as main_generate_and_evaluate
from judgearena.log import configure_logging, get_logger

Expand Down Expand Up @@ -196,13 +196,21 @@ def _build_elo_args(
provide_explanation=args.provide_explanation,
swap_mode=args.swap_mode,
ignore_cache=args.ignore_cache,
judge_prompt_preset=args.judge_prompt_preset,
mt_bench_judge_mode=args.mt_bench_judge_mode,
battle_thinking_token_budget=args.battle_thinking_token_budget,
strip_thinking_before_judging=args.strip_thinking_before_judging,
skip_judging=args.skip_judging,
truncate_all_input_chars=args.truncate_all_input_chars,
truncate_judge_input_chars=args.truncate_judge_input_chars,
max_out_tokens_models=args.max_out_tokens_models,
max_out_tokens_judge=args.max_out_tokens_judge,
max_model_len=args.max_model_len,
max_judge_model_len=args.max_judge_model_len,
chat_template=args.chat_template,
result_folder=args.result_folder,
engine_kwargs=parse_engine_kwargs(args.engine_kwargs),
judge_engine_kwargs=parse_engine_kwargs(args.judge_engine_kwargs),
verbosity=resolve_verbosity(args),
log_file=args.log_file,
no_log_file=args.no_log_file,
Expand All @@ -212,7 +220,9 @@ def _build_elo_args(
def _build_generate_and_evaluate_args(
args: argparse.Namespace, task: str, model_a: str | None
) -> CliArgs:
if model_a is None or args.model_B is None:
if model_a is None or (
args.model_B is None and native_pairwise_baseline(task) is None
):
raise SystemExit(f"--model_A and --model_B are required for task {task!r}.")
return CliArgs(
task=task,
Expand All @@ -224,13 +234,21 @@ def _build_generate_and_evaluate_args(
provide_explanation=args.provide_explanation,
swap_mode=args.swap_mode,
ignore_cache=args.ignore_cache,
judge_prompt_preset=args.judge_prompt_preset,
mt_bench_judge_mode=args.mt_bench_judge_mode,
battle_thinking_token_budget=args.battle_thinking_token_budget,
strip_thinking_before_judging=args.strip_thinking_before_judging,
skip_judging=args.skip_judging,
truncate_all_input_chars=args.truncate_all_input_chars,
truncate_judge_input_chars=args.truncate_judge_input_chars,
max_out_tokens_models=args.max_out_tokens_models,
max_out_tokens_judge=args.max_out_tokens_judge,
max_model_len=args.max_model_len,
max_judge_model_len=args.max_judge_model_len,
chat_template=args.chat_template,
result_folder=args.result_folder,
engine_kwargs=parse_engine_kwargs(args.engine_kwargs),
judge_engine_kwargs=parse_engine_kwargs(args.judge_engine_kwargs),
verbosity=resolve_verbosity(args),
log_file=args.log_file,
no_log_file=args.no_log_file,
Expand Down
Loading
Loading