Per-role GenerationConfig and backend plumbing#41
Open
alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
Open
Per-role GenerationConfig and backend plumbing#41alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
alexrs-cohere wants to merge 1 commit intoOpenEuroLLM:mainfrom
Conversation
Reworks the CLI surface so every knob that can affect generated text is exposed independently for the three runtime roles - model A, model B, and the LLM judge - and is recorded explicitly in the run. Public CLI surface ------------------ For each of ``A``, ``B``, ``judge`` the parser now accepts: --temperature_<role> --top_p_<role> --top_k_<role> --seed_<role> --max_out_tokens_<role> --max_model_len_<role> --chat_template_<role> --engine_kwargs_<role> The legacy non-role-aware flags - ``--max_out_tokens_models``, ``--max_model_len``, ``--chat_template``, ``--engine_kwargs`` - keep working as deprecated aliases that fan out to the appropriate roles with a ``DeprecationWarning``. Per-role flags always win when both are set. Internals --------- ``GenerationConfig`` (frozen dataclass in ``cli_common``) packages the eight knobs above; ``BaseCliArgs`` now holds three of them (``gen_A``/``gen_B``/``gen_judge``) instead of the previous flat list of ``max_out_tokens_models`` / ``max_out_tokens_judge`` / ``max_model_len`` / ``chat_template`` / ``engine_kwargs``. ``resolve_generation_configs`` turns a parsed ``argparse.Namespace`` into the per-role configs, and ``gen_config_to_invoke_kwargs`` flattens a config into the kwargs that ``make_model`` / ``generate_*`` accept. Backends -------- - ``ChatVLLM`` accepts ``temperature``, ``top_p``, ``top_k``, ``seed`` and writes them through to ``SamplingParams`` (it used to hard-code ``temperature=0.6`` / ``top_p=0.95``); a ``set_temperature`` method is added so MT-Bench's category-aware temperature switching can keep working without reaching into private attributes. - ``make_model`` extracts the cross-backend sampling fields from ``engine_kwargs`` and routes them to the provider-appropriate constructor argument: ``ChatOpenAI`` / OpenRouter / Together / ``LlamaCpp`` all see ``temperature``/``top_p``/``seed`` directly; ``top_k`` is tunneled through ``model_kwargs`` for OpenAI-compatible backends. - ``DummyModel`` records every kwarg the constructor sees on ``init_kwargs`` so tests can assert that per-role configs reach the model layer. - ``do_inference`` gains an optional ``out_metadata`` argument that collects per-call provider response metadata (``system_fingerprint``, ``model_name``, etc.) when present. Call sites ---------- ``generate_and_evaluate.main``, ``estimate_elo_ratings.main`` and ``mt_bench/mt_bench_utils._generate_mt_bench_completions`` now build their generators from ``args.gen_A`` / ``args.gen_B`` / ``args.gen_judge`` via ``gen_config_to_invoke_kwargs``. MT-Bench keeps its category-aware temperature defaults but only when the user has not explicitly pinned a role-level temperature; otherwise the CLI override wins. The judge call in MT-Bench keeps its historical ``temperature=0.0`` default but accepts overrides from ``--temperature_judge``. Backward compatibility ---------------------- The cache key formats and ``run-metadata.v1.json`` schema are intentionally unchanged in this PR. Follow-up PRs in the reproducibility-hardening stack content-address the cache key off the ``GenerationConfig`` and add the resolved per-role configs to the run metadata. Tests ----- - ``tests/test_seed_plumbing.py`` (new) - asserts that ``--seed_A`` and ``--temperature_judge`` reach the right ``GenerationConfig`` and that ``DummyModel.init_kwargs`` captures the values flowing through ``make_model``. - ``tests/test_cli.py`` - new tests for per-role temperature/seed flags, deprecated ``--engine_kwargs`` fan-out behaviour, per-role override over deprecation alias, and ``--max_out_tokens_models`` fan-out. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reworks the CLI surface so every knob that can affect generated text is exposed independently for the three runtime roles — model A, model B, and the LLM judge — and is recorded explicitly in the run.
Public CLI surface
For each of
A,B,judgethe parser now accepts:The legacy non-role-aware flags —
--max_out_tokens_models,--max_model_len,--chat_template,--engine_kwargs— keep working as deprecated aliases that fan out to the appropriate roles with aDeprecationWarning. Per-role flags always win when both are set.Internals
GenerationConfig(frozen dataclass incli_common) packages the eight knobs above.BaseCliArgsnow holds three of them (gen_A/gen_B/gen_judge) instead of the previous flat list ofmax_out_tokens_models/max_out_tokens_judge/max_model_len/chat_template/engine_kwargs.resolve_generation_configsturns a parsedargparse.Namespaceinto the per-role configs.gen_config_to_invoke_kwargsflattens a config into the kwargs thatmake_model/generate_*accept.Backends
ChatVLLMacceptstemperature,top_p,top_k,seedand writes them through toSamplingParams(it used to hard-codetemperature=0.6/top_p=0.95); a newset_temperaturemethod lets MT-Bench's category-aware temperature switching keep working without reaching into private attributes.make_modelextracts the cross-backend sampling fields fromengine_kwargsand routes them to the provider-appropriate constructor argument:ChatOpenAI/ OpenRouter / Together /LlamaCppall seetemperature/top_p/seeddirectly;top_kis tunneled throughmodel_kwargsfor OpenAI-compatible backends.DummyModelrecords every kwarg the constructor sees oninit_kwargsso tests can assert that per-role configs reach the model layer.do_inferencegains an optionalout_metadataargument that collects per-call provider response metadata (system_fingerprint,model_name, etc.) when present.Call sites
generate_and_evaluate.main,estimate_elo_ratings.main, andmt_bench/mt_bench_utils._generate_mt_bench_completionsnow build their generators fromargs.gen_A/args.gen_B/args.gen_judgeviagen_config_to_invoke_kwargs. MT-Bench keeps its category-aware temperature defaults but only when the user has not explicitly pinned a role-level temperature; otherwise the CLI override wins. The judge call in MT-Bench keeps its historicaltemperature=0.0default but accepts overrides from--temperature_judge.Backward compatibility & out of scope
The cache-key format and
run-metadata.v1.jsonschema are intentionally unchanged in this PR — bumping the schema and content-addressing the cache offGenerationConfigship in follow-up PRs in the reproducibility-hardening stack.Why
Today vLLM's sampling parameters are hard-coded (
temperature=0.6,top_p=0.95) at the wrapper layer, hosted providers ignore the seed, and the same--max_out_tokens_modelsvalue is forced on both battle models even when you'd want to ablate them independently. This PR is the foundation for everything else: once a run is described by three explicitGenerationConfigs, the cache key, the run metadata, and the--rerunhelper all have something concrete to hash and replay.Test plan
uv run pytest -q— 83/83 pass on this branch in isolation; 117/117 pass when stacked with the rest of the reproducibility-hardening work.tests/test_seed_plumbing.py(new) —--seed_A/--temperature_judgeland on the correctGenerationConfig;DummyModel.init_kwargscaptures the values flowing throughmake_model;gen_config_to_invoke_kwargsskips unset fields.tests/test_cli.py— new tests for per-role temperature/seed flags, deprecated--engine_kwargsfan-out behaviour, per-role override winning over the deprecation alias, and--max_out_tokens_modelsfanning out to A and B but not judge.--seed_A/--temperature_A 0.0to confirm the seed reachesSamplingParams.