MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding#127
Open
bongwoobak wants to merge 6 commits into
Open
MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding#127bongwoobak wants to merge 6 commits into
bongwoobak wants to merge 6 commits into
Conversation
…n models Create v0.20.1 IST presets for all currently deployed models on aiand-rke2. Based on official vLLM recipes (https://github.com/vllm-project/recipes). Key changes from v0.19.x: - Unified to single vllm/vllm-openai:v0.20.1 image (drops cu130-nightly, glm51-cu130, gemma4 custom images — official wheel now includes CUDA 13.0.2, transformers v5, gemma4 arch, and SM103 support) - GLM-5/5.1: removed ISVC_PRE_PROCESS_SCRIPT (transformers v5 baseline) - GLM-5.1: removed --chat-template-content-format string workaround (vllm#39899 tool result fix included) - Qwen3.5: removed VLLM_USE_DEEP_GEMM=0 (vllm#38083 auto-disables on Blackwell) - DeepSeek V3.2: switched back to dp8-moe-ep8 (multiple indirect fixes for vllm#27259 included — WideEP all2all replacement, EPLB refactor, DSV3.2 token leakage fix) - Multimodal enabled for Qwen3.5 (all sizes), Gemma 4, Gemma 3 — image input supported, video where supported by model - Kimi K2.5: video=0 explicitly set (vLLM kimi_k2 modeling code does not register video processor — video only available via Moonshot's official API) Models included: - Qwen3.5-9B (L40S tp1) - Qwen3.5-27B (H200 tp1) - Qwen3.5-397B-A17B-FP8 (B300 dp4-moe-ep4) - GLM-5-FP8 (H200 tp8 mtp) - GLM-5.1-FP8 (B300 tp8 mtp) - Gemma 4 31B (L40S tp4 multimodal) - Gemma 3 27B (L40S tp4 image-only) - DeepSeek V3.2 (H200 dp8-moe-ep8) - GPT-OSS-120B (H100-NVL tp4-moe-tp4) - Kimi K2.5 (B300 tp8 image-only) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts + add Kimi K2.6 Add speculative decoding to all v0.20.1 presets where supported: Native MTP (preferred over Eagle3 where available): - Qwen3.5-9B: qwen3_next_mtp, num_spec=3, renamed to -mtp - Qwen3.5-27B: qwen3_next_mtp, num_spec=3, renamed to -mtp - Qwen3.5-397B: mtp (MoE variant), num_spec=3, renamed to -mtp - DeepSeek V3.2: deepseek_mtp, num_spec=3, renamed to -mtp Draft model / Eagle3 (no native MTP available): - Gemma 4 31B: google/gemma-4-31B-it-assistant draft model, num_spec=3 - GPT-OSS-120B: nvidia/gpt-oss-120b-Eagle3-v2, eagle3, num_spec=3 - Kimi K2.5: lightseekorg/kimi-k2.5-eagle3, eagle3, num_spec=3 No spec decoding (not supported): - Gemma 3 27B: no official drafter, ngram only (skipped) Other changes: - Kimi K2.5: enable video input (--limit-mm-per-prompt video=1) — vLLM kimi_k25 modeling code does support VisionChunkVideo, contrary to earlier assumption based only on HF model card. Actual behavior to be tested. - New: Kimi K2.6 (B300 tp8) IST with Eagle3 draft model Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new set of vLLM v0.20.1 Odin preset templates for multiple production models, enabling multimodal limits and speculative decoding where applicable, and introducing a new Kimi K2.6 preset.
Changes:
- Added
v0.20.1InferenceServiceTemplatepresets for Qwen3.5, GLM-5/5.1, DeepSeek V3.2, Gemma 3/4, GPT-OSS-120B, and Kimi K2.5/K2.6. - Enabled per-model speculative decoding configuration (MTP / Eagle3 drafter) and multimodal prompt limits for supported models.
- Standardized presets to the unified
vllm/vllm-openai:v0.20.1image.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/zai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yaml | Adds GLM-5.1 FP8 TP8 MoE preset with MTP args on B300. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/zai-org-glm-5-fp8-mtp-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml | Adds GLM-5 FP8 TP8 MoE preset with MTP args on H200 SXM. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-9b-mtp-nvidia-l40s-1.helm.yaml | Adds Qwen3.5-9B TP1 preset with multimodal limits and Qwen MTP config. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-397b-a17b-fp8-mtp-nvidia-b300-dp4-moe-ep4.helm.yaml | Adds Qwen3.5-397B FP8 DP4+EP preset with multimodal limits and MTP. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-27b-mtp-nvidia-h200-sxm-1.helm.yaml | Adds Qwen3.5-27B TP1 preset with multimodal limits and Qwen MTP config. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/openai-gpt-oss-120b-nvidia-h100-nvl-tp4-moe-tp4.helm.yaml | Adds GPT-OSS-120B TP4 MoE preset with Eagle3 drafter config. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml | Adds new Kimi K2.6 TP8 preset with multimodal limits and Eagle3 drafter. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.5-nvidia-b300-tp8.helm.yaml | Adds Kimi K2.5 TP8 preset with multimodal limits and Eagle3 drafter. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/google-gemma-4-31b-it-nvidia-l40s-tp4.helm.yaml | Adds Gemma 4 TP4 preset with image limits and speculative decoding config. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/google-gemma-3-27b-it-nvidia-l40s-tp4.helm.yaml | Adds Gemma 3 TP4 preset with image limits. |
| deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/deepseek-ai-deepseek-v3.2-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml | Adds DeepSeek V3.2 DP8+EP preset with DeepSeek MTP config. |
- Qwen3.5-9B/27B: parallelism label "tp1" → "1" (single-device convention) - Qwen3.5-397B (DP4+EP4): spec.template → spec.workerTemplate - DeepSeek V3.2 (DP8+EP8): spec.template → spec.workerTemplate - GPT-OSS-120B: remove --no-enable-prefix-caching (user-configurable knob, per AGENTS.md:183-189) - Gemma 4: add explicit method=draft_model to --speculative-config (vLLM SpeculativeConfig auto-detects but explicit is safer)
nulledge
reviewed
May 8, 2026
Contributor
nulledge
left a comment
There was a problem hiding this comment.
Checked the diff — everything is redundant with the Copilot review, except the comment about Gemma 4.
It looks incorrect — omitting method with model set falls through to vLLM's auto-detect path, so speculative decoding stays on for Gemma 4.
Per @nulledge review: omitting `method` with `model` set falls through to vLLM's auto-detect path (speculative.py:518-522, 650-688). The explicit `method=draft_model` added in the previous commit was unnecessary and not aligned with the official Gemma 4 recipe.
Match the upstream vLLM recipe (models/moonshotai/Kimi-K2.6.yaml): - Switch Eagle3 draft from `lightseekorg/kimi-k2.6-eagle3` to `lightseekorg/kimi-k2.6-eagle3-mla` (MLA-tuned variant aligned with Kimi K2.6's MLA attention). - Add `--attention-config.use_trtllm_ragged_deepseek_prefill=True` per the recipe's hardware_overrides.blackwell entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines
+39
to
+42
| --reasoning-parser kimi_k2 | ||
| --enable-auto-tool-choice | ||
| --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3-mla","method":"eagle3","num_speculative_tokens":3}' | ||
| --attention-config.use_trtllm_ragged_deepseek_prefill=True |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v0.19.x → v0.20.1 Changes
Image Consolidation (3 custom images → unified
vllm/vllm-openai:v0.20.1)cu130-nightlyglm51-cu130gemma4Workaround Removal
ISVC_PRE_PROCESS_SCRIPT(transformers runtime install) — GLM-5/5.1 (transformers v5 baseline)--chat-template-content-format string— GLM-5.1 (vllm#39899 included)VLLM_USE_DEEP_GEMM=0— Qwen3.5 all sizes (vllm#38083 auto-disables on Blackwell)Workaround Retained
VLLM_WORKER_MULTIPROC_METHOD=spawn— Gemma 4 (CUDA fork issue vllm#32611 still open)Speculative Decoding Matrix
Native MTP (preferred):
qwen3_next_mtpqwen3_next_mtpmtp(MoE variant)mtpdeepseek_mtpEagle3 / Draft model (no native MTP):
google/gemma-4-31B-it-assistant(draft model)nvidia/gpt-oss-120b-Eagle3-v2lightseekorg/kimi-k2.5-eagle3lightseekorg/kimi-k2.6-eagle3No spec decoding: Gemma 3 27B (no official drafter)
Multimodal Activation
Note: Kimi K2.5 video — vLLM
kimi_k25modeling code does registerVisionChunkVideoand video processing logic, contrary to the HF model card statement that video is "experimental, only supported in our official API". Actual behavior in vLLM should be verified.DeepSeek V3.2 — TP8 → DP8+EP
Switching back to DP8+EP. The original crash issue (vllm#27259) was auto-closed without explicit fix, but multiple indirect fixes are included in v0.20.1:
Worth retesting; if successful, recovers ~75% performance overhead from TP8 fallback.
Files Added
deepseek-ai-deepseek-v3.2-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yamlgoogle-gemma-3-27b-it-nvidia-l40s-tp4.helm.yamlgoogle-gemma-4-31b-it-nvidia-l40s-tp4.helm.yamlmoonshotai-kimi-k2.5-nvidia-b300-tp8.helm.yamlmoonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml(new model)openai-gpt-oss-120b-nvidia-h100-nvl-tp4-moe-tp4.helm.yamlqwen-qwen3.5-9b-mtp-nvidia-l40s-1.helm.yamlqwen-qwen3.5-27b-mtp-nvidia-h200-sxm-1.helm.yamlqwen-qwen3.5-397b-a17b-fp8-mtp-nvidia-b300-dp4-moe-ep4.helm.yamlzai-org-glm-5-fp8-mtp-nvidia-h200-sxm-tp8-moe-tp8.helm.yamlzai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yamlTest plan
helm lint deploy/helm/moai-inference-presethelm template -s templates/presets/vllm/v0.20.1/...🤖 Generated with Claude Code