MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding by bongwoobak · Pull Request #127 · moreh-dev/mif

bongwoobak · 2026-05-06T11:40:51Z

Summary

Add v0.20.1 IST presets for all 10 production models on aiand-rke2
Enable multimodal (image/video) where supported by the model
Apply speculative decoding (MTP / Eagle3 / draft model) per model capability
Add new Kimi K2.6 preset (released 2026-04-20)

v0.19.x → v0.20.1 Changes

Image Consolidation (3 custom images → unified `vllm/vllm-openai:v0.20.1`)

Old Image	Models	Why Removed
`cu130-nightly`	Qwen3.5-397B, Kimi K2.5	CUDA 13.0.2 is now the default wheel (SM103 supported)
`glm51-cu130`	GLM-5.1	transformers v5 baseline + tool result fix included
`gemma4`	Gemma 4 31B	transformers v5 baseline includes gemma4 arch

Workaround Removal

ISVC_PRE_PROCESS_SCRIPT (transformers runtime install) — GLM-5/5.1 (transformers v5 baseline)
--chat-template-content-format string — GLM-5.1 (vllm#39899 included)
VLLM_USE_DEEP_GEMM=0 — Qwen3.5 all sizes (vllm#38083 auto-disables on Blackwell)

Workaround Retained

VLLM_WORKER_MULTIPROC_METHOD=spawn — Gemma 4 (CUDA fork issue vllm#32611 still open)

Speculative Decoding Matrix

Native MTP (preferred):

Model	Method	num_spec
Qwen3.5-9B	`qwen3_next_mtp`	3
Qwen3.5-27B	`qwen3_next_mtp`	3
Qwen3.5-397B	`mtp` (MoE variant)	3
GLM-5 / GLM-5.1	`mtp`	3
DeepSeek V3.2	`deepseek_mtp`	3

Eagle3 / Draft model (no native MTP):

Model	Drafter	num_spec
Gemma 4 31B	`google/gemma-4-31B-it-assistant` (draft model)	3
GPT-OSS-120B	`nvidia/gpt-oss-120b-Eagle3-v2`	3
Kimi K2.5	`lightseekorg/kimi-k2.5-eagle3`	3
Kimi K2.6 (new)	`lightseekorg/kimi-k2.6-eagle3`	3

No spec decoding: Gemma 3 27B (no official drafter)

Multimodal Activation

Model	Image	Video	Audio
Qwen3.5-9B	image:2	video:0 (L40S memory)	-
Qwen3.5-27B	image:4	video:1	-
Qwen3.5-397B	image:2	video:1	-
Gemma 4 31B	image:4	-	audio:0 (31B variant has no audio encoder)
Gemma 3 27B	image:4	-	-
Kimi K2.5 / K2.6	image:4	video:1	-
GLM-5 / 5.1, DeepSeek V3.2, GPT-OSS	text only	-	-

Note: Kimi K2.5 video — vLLM kimi_k25 modeling code does register VisionChunkVideo and video processing logic, contrary to the HF model card statement that video is "experimental, only supported in our official API". Actual behavior in vLLM should be verified.

DeepSeek V3.2 — TP8 → DP8+EP

Switching back to DP8+EP. The original crash issue (vllm#27259) was auto-closed without explicit fix, but multiple indirect fixes are included in v0.20.1:

WideEP all2all replacement (vllm#33728)
Async EPLB synchronization refactor (vllm#37601)
DSV3.2 token leakage fix (vllm#40806)
DSA + MTP IMA fix (vllm#40772)

Worth retesting; if successful, recovers ~75% performance overhead from TP8 fallback.

Files Added

deepseek-ai-deepseek-v3.2-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml
google-gemma-3-27b-it-nvidia-l40s-tp4.helm.yaml
google-gemma-4-31b-it-nvidia-l40s-tp4.helm.yaml
moonshotai-kimi-k2.5-nvidia-b300-tp8.helm.yaml
moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml (new model)
openai-gpt-oss-120b-nvidia-h100-nvl-tp4-moe-tp4.helm.yaml
qwen-qwen3.5-9b-mtp-nvidia-l40s-1.helm.yaml
qwen-qwen3.5-27b-mtp-nvidia-h200-sxm-1.helm.yaml
qwen-qwen3.5-397b-a17b-fp8-mtp-nvidia-b300-dp4-moe-ep4.helm.yaml
zai-org-glm-5-fp8-mtp-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
zai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yaml

Test plan

helm lint deploy/helm/moai-inference-preset
Render presets via helm template -s templates/presets/vllm/v0.20.1/...
Deploy each model on aiand-rke2 staging and verify:
- Tool calling works (multi-turn)
- Multimodal input works (image, video where applicable)
- Speculative decoding accepts tokens (check acceptance rate logs)
- Kimi K2.5 video_url request returns valid response (or documented error)
- DeepSeek V3.2 DP8+EP no longer crashes under load

🤖 Generated with Claude Code

…n models Create v0.20.1 IST presets for all currently deployed models on aiand-rke2. Based on official vLLM recipes (https://github.com/vllm-project/recipes). Key changes from v0.19.x: - Unified to single vllm/vllm-openai:v0.20.1 image (drops cu130-nightly, glm51-cu130, gemma4 custom images — official wheel now includes CUDA 13.0.2, transformers v5, gemma4 arch, and SM103 support) - GLM-5/5.1: removed ISVC_PRE_PROCESS_SCRIPT (transformers v5 baseline) - GLM-5.1: removed --chat-template-content-format string workaround (vllm#39899 tool result fix included) - Qwen3.5: removed VLLM_USE_DEEP_GEMM=0 (vllm#38083 auto-disables on Blackwell) - DeepSeek V3.2: switched back to dp8-moe-ep8 (multiple indirect fixes for vllm#27259 included — WideEP all2all replacement, EPLB refactor, DSV3.2 token leakage fix) - Multimodal enabled for Qwen3.5 (all sizes), Gemma 4, Gemma 3 — image input supported, video where supported by model - Kimi K2.5: video=0 explicitly set (vLLM kimi_k2 modeling code does not register video processor — video only available via Moonshot's official API) Models included: - Qwen3.5-9B (L40S tp1) - Qwen3.5-27B (H200 tp1) - Qwen3.5-397B-A17B-FP8 (B300 dp4-moe-ep4) - GLM-5-FP8 (H200 tp8 mtp) - GLM-5.1-FP8 (B300 tp8 mtp) - Gemma 4 31B (L40S tp4 multimodal) - Gemma 3 27B (L40S tp4 image-only) - DeepSeek V3.2 (H200 dp8-moe-ep8) - GPT-OSS-120B (H100-NVL tp4-moe-tp4) - Kimi K2.5 (B300 tp8 image-only) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ts + add Kimi K2.6 Add speculative decoding to all v0.20.1 presets where supported: Native MTP (preferred over Eagle3 where available): - Qwen3.5-9B: qwen3_next_mtp, num_spec=3, renamed to -mtp - Qwen3.5-27B: qwen3_next_mtp, num_spec=3, renamed to -mtp - Qwen3.5-397B: mtp (MoE variant), num_spec=3, renamed to -mtp - DeepSeek V3.2: deepseek_mtp, num_spec=3, renamed to -mtp Draft model / Eagle3 (no native MTP available): - Gemma 4 31B: google/gemma-4-31B-it-assistant draft model, num_spec=3 - GPT-OSS-120B: nvidia/gpt-oss-120b-Eagle3-v2, eagle3, num_spec=3 - Kimi K2.5: lightseekorg/kimi-k2.5-eagle3, eagle3, num_spec=3 No spec decoding (not supported): - Gemma 3 27B: no official drafter, ngram only (skipped) Other changes: - Kimi K2.5: enable video input (--limit-mm-per-prompt video=1) — vLLM kimi_k25 modeling code does support VisionChunkVideo, contrary to earlier assumption based only on HF model card. Actual behavior to be tested. - New: Kimi K2.6 (B300 tp8) IST with Eagle3 draft model Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new set of vLLM v0.20.1 Odin preset templates for multiple production models, enabling multimodal limits and speculative decoding where applicable, and introducing a new Kimi K2.6 preset.

Changes:

Added v0.20.1 InferenceServiceTemplate presets for Qwen3.5, GLM-5/5.1, DeepSeek V3.2, Gemma 3/4, GPT-OSS-120B, and Kimi K2.5/K2.6.
Enabled per-model speculative decoding configuration (MTP / Eagle3 drafter) and multimodal prompt limits for supported models.
Standardized presets to the unified vllm/vllm-openai:v0.20.1 image.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/zai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yaml	Adds GLM-5.1 FP8 TP8 MoE preset with MTP args on B300.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/zai-org-glm-5-fp8-mtp-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml	Adds GLM-5 FP8 TP8 MoE preset with MTP args on H200 SXM.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-9b-mtp-nvidia-l40s-1.helm.yaml	Adds Qwen3.5-9B TP1 preset with multimodal limits and Qwen MTP config.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-397b-a17b-fp8-mtp-nvidia-b300-dp4-moe-ep4.helm.yaml	Adds Qwen3.5-397B FP8 DP4+EP preset with multimodal limits and MTP.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-27b-mtp-nvidia-h200-sxm-1.helm.yaml	Adds Qwen3.5-27B TP1 preset with multimodal limits and Qwen MTP config.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/openai-gpt-oss-120b-nvidia-h100-nvl-tp4-moe-tp4.helm.yaml	Adds GPT-OSS-120B TP4 MoE preset with Eagle3 drafter config.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml	Adds new Kimi K2.6 TP8 preset with multimodal limits and Eagle3 drafter.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.5-nvidia-b300-tp8.helm.yaml	Adds Kimi K2.5 TP8 preset with multimodal limits and Eagle3 drafter.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/google-gemma-4-31b-it-nvidia-l40s-tp4.helm.yaml	Adds Gemma 4 TP4 preset with image limits and speculative decoding config.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/google-gemma-3-27b-it-nvidia-l40s-tp4.helm.yaml	Adds Gemma 3 TP4 preset with image limits.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/deepseek-ai-deepseek-v3.2-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml	Adds DeepSeek V3.2 DP8+EP preset with DeepSeek MTP config.

- Qwen3.5-9B/27B: parallelism label "tp1" → "1" (single-device convention) - Qwen3.5-397B (DP4+EP4): spec.template → spec.workerTemplate - DeepSeek V3.2 (DP8+EP8): spec.template → spec.workerTemplate - GPT-OSS-120B: remove --no-enable-prefix-caching (user-configurable knob, per AGENTS.md:183-189) - Gemma 4: add explicit method=draft_model to --speculative-config (vLLM SpeculativeConfig auto-detects but explicit is safer)

nulledge

Checked the diff — everything is redundant with the Copilot review, except the comment about Gemma 4.

It looks incorrect — omitting method with model set falls through to vLLM's auto-detect path, so speculative decoding stays on for Gemma 4.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

@nulledge

Per @nulledge review: omitting `method` with `model` set falls through to vLLM's auto-detect path (speculative.py:518-522, 650-688). The explicit `method=draft_model` added in the previous commit was unnecessary and not aligned with the official Gemma 4 recipe.

Match the upstream vLLM recipe (models/moonshotai/Kimi-K2.6.yaml): - Switch Eagle3 draft from `lightseekorg/kimi-k2.6-eagle3` to `lightseekorg/kimi-k2.6-eagle3-mla` (MLA-tuned variant aligned with Kimi K2.6's MLA attention). - Add `--attention-config.use_trtllm_ragged_deepseek_prefill=True` per the recipe's hardware_overrides.blackwell entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

+                --reasoning-parser kimi_k2
+                --enable-auto-tool-choice
+                --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3-mla","method":"eagle3","num_speculative_tokens":3}'
+                --attention-config.use_trtllm_ragged_deepseek_prefill=True


bongwoobak and others added 2 commits May 6, 2026 20:22

Copilot AI review requested due to automatic review settings May 6, 2026 11:40

bongwoobak requested a review from a team as a code owner May 6, 2026 11:40

gitgod-bot assigned bongwoobak May 6, 2026

Copilot started reviewing on behalf of bongwoobak May 6, 2026 11:41 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

bongwoobak changed the title ~~NO-ISSUE: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding~~ MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding May 6, 2026

bongwoobak added 2 commits May 8, 2026 14:30

Merge branch 'main' into feat/v0.20.1-presets

ad4b4c9

Copilot AI review requested due to automatic review settings May 8, 2026 05:30

Copilot started reviewing on behalf of bongwoobak May 8, 2026 05:31 View session

nulledge reviewed May 8, 2026

View reviewed changes

Copilot AI reviewed May 8, 2026

View reviewed changes

bongwoobak and others added 2 commits May 11, 2026 02:27

Copilot AI review requested due to automatic review settings May 11, 2026 09:00

Copilot started reviewing on behalf of bongwoobak May 11, 2026 09:00 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding#127

MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding#127
bongwoobak wants to merge 6 commits into
mainfrom
feat/v0.20.1-presets

bongwoobak commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nulledge left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bongwoobak commented May 6, 2026

Summary

v0.19.x → v0.20.1 Changes

Image Consolidation (3 custom images → unified vllm/vllm-openai:v0.20.1)

Workaround Removal

Workaround Retained

Speculative Decoding Matrix

Multimodal Activation

DeepSeek V3.2 — TP8 → DP8+EP

Files Added

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nulledge left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Image Consolidation (3 custom images → unified `vllm/vllm-openai:v0.20.1`)