Skip to content

MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding#127

Open
bongwoobak wants to merge 6 commits into
mainfrom
feat/v0.20.1-presets
Open

MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding#127
bongwoobak wants to merge 6 commits into
mainfrom
feat/v0.20.1-presets

Conversation

@bongwoobak
Copy link
Copy Markdown
Contributor

Summary

  • Add v0.20.1 IST presets for all 10 production models on aiand-rke2
  • Enable multimodal (image/video) where supported by the model
  • Apply speculative decoding (MTP / Eagle3 / draft model) per model capability
  • Add new Kimi K2.6 preset (released 2026-04-20)

v0.19.x → v0.20.1 Changes

Image Consolidation (3 custom images → unified vllm/vllm-openai:v0.20.1)

Old Image Models Why Removed
cu130-nightly Qwen3.5-397B, Kimi K2.5 CUDA 13.0.2 is now the default wheel (SM103 supported)
glm51-cu130 GLM-5.1 transformers v5 baseline + tool result fix included
gemma4 Gemma 4 31B transformers v5 baseline includes gemma4 arch

Workaround Removal

  • ISVC_PRE_PROCESS_SCRIPT (transformers runtime install) — GLM-5/5.1 (transformers v5 baseline)
  • --chat-template-content-format string — GLM-5.1 (vllm#39899 included)
  • VLLM_USE_DEEP_GEMM=0 — Qwen3.5 all sizes (vllm#38083 auto-disables on Blackwell)

Workaround Retained

  • VLLM_WORKER_MULTIPROC_METHOD=spawn — Gemma 4 (CUDA fork issue vllm#32611 still open)

Speculative Decoding Matrix

Native MTP (preferred):

Model Method num_spec
Qwen3.5-9B qwen3_next_mtp 3
Qwen3.5-27B qwen3_next_mtp 3
Qwen3.5-397B mtp (MoE variant) 3
GLM-5 / GLM-5.1 mtp 3
DeepSeek V3.2 deepseek_mtp 3

Eagle3 / Draft model (no native MTP):

Model Drafter num_spec
Gemma 4 31B google/gemma-4-31B-it-assistant (draft model) 3
GPT-OSS-120B nvidia/gpt-oss-120b-Eagle3-v2 3
Kimi K2.5 lightseekorg/kimi-k2.5-eagle3 3
Kimi K2.6 (new) lightseekorg/kimi-k2.6-eagle3 3

No spec decoding: Gemma 3 27B (no official drafter)

Multimodal Activation

Model Image Video Audio
Qwen3.5-9B image:2 video:0 (L40S memory) -
Qwen3.5-27B image:4 video:1 -
Qwen3.5-397B image:2 video:1 -
Gemma 4 31B image:4 - audio:0 (31B variant has no audio encoder)
Gemma 3 27B image:4 - -
Kimi K2.5 / K2.6 image:4 video:1 -
GLM-5 / 5.1, DeepSeek V3.2, GPT-OSS text only - -

Note: Kimi K2.5 video — vLLM kimi_k25 modeling code does register VisionChunkVideo and video processing logic, contrary to the HF model card statement that video is "experimental, only supported in our official API". Actual behavior in vLLM should be verified.

DeepSeek V3.2 — TP8 → DP8+EP

Switching back to DP8+EP. The original crash issue (vllm#27259) was auto-closed without explicit fix, but multiple indirect fixes are included in v0.20.1:

  • WideEP all2all replacement (vllm#33728)
  • Async EPLB synchronization refactor (vllm#37601)
  • DSV3.2 token leakage fix (vllm#40806)
  • DSA + MTP IMA fix (vllm#40772)

Worth retesting; if successful, recovers ~75% performance overhead from TP8 fallback.

Files Added

  • deepseek-ai-deepseek-v3.2-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml
  • google-gemma-3-27b-it-nvidia-l40s-tp4.helm.yaml
  • google-gemma-4-31b-it-nvidia-l40s-tp4.helm.yaml
  • moonshotai-kimi-k2.5-nvidia-b300-tp8.helm.yaml
  • moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml (new model)
  • openai-gpt-oss-120b-nvidia-h100-nvl-tp4-moe-tp4.helm.yaml
  • qwen-qwen3.5-9b-mtp-nvidia-l40s-1.helm.yaml
  • qwen-qwen3.5-27b-mtp-nvidia-h200-sxm-1.helm.yaml
  • qwen-qwen3.5-397b-a17b-fp8-mtp-nvidia-b300-dp4-moe-ep4.helm.yaml
  • zai-org-glm-5-fp8-mtp-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml
  • zai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yaml

Test plan

  • helm lint deploy/helm/moai-inference-preset
  • Render presets via helm template -s templates/presets/vllm/v0.20.1/...
  • Deploy each model on aiand-rke2 staging and verify:
    • Tool calling works (multi-turn)
    • Multimodal input works (image, video where applicable)
    • Speculative decoding accepts tokens (check acceptance rate logs)
    • Kimi K2.5 video_url request returns valid response (or documented error)
    • DeepSeek V3.2 DP8+EP no longer crashes under load

🤖 Generated with Claude Code

bongwoobak and others added 2 commits May 6, 2026 20:22
…n models

Create v0.20.1 IST presets for all currently deployed models on aiand-rke2.
Based on official vLLM recipes (https://github.com/vllm-project/recipes).

Key changes from v0.19.x:
- Unified to single vllm/vllm-openai:v0.20.1 image (drops cu130-nightly,
  glm51-cu130, gemma4 custom images — official wheel now includes CUDA 13.0.2,
  transformers v5, gemma4 arch, and SM103 support)
- GLM-5/5.1: removed ISVC_PRE_PROCESS_SCRIPT (transformers v5 baseline)
- GLM-5.1: removed --chat-template-content-format string workaround
  (vllm#39899 tool result fix included)
- Qwen3.5: removed VLLM_USE_DEEP_GEMM=0 (vllm#38083 auto-disables on Blackwell)
- DeepSeek V3.2: switched back to dp8-moe-ep8 (multiple indirect fixes for
  vllm#27259 included — WideEP all2all replacement, EPLB refactor,
  DSV3.2 token leakage fix)
- Multimodal enabled for Qwen3.5 (all sizes), Gemma 4, Gemma 3 — image input
  supported, video where supported by model
- Kimi K2.5: video=0 explicitly set (vLLM kimi_k2 modeling code does not
  register video processor — video only available via Moonshot's official API)

Models included:
- Qwen3.5-9B (L40S tp1)
- Qwen3.5-27B (H200 tp1)
- Qwen3.5-397B-A17B-FP8 (B300 dp4-moe-ep4)
- GLM-5-FP8 (H200 tp8 mtp)
- GLM-5.1-FP8 (B300 tp8 mtp)
- Gemma 4 31B (L40S tp4 multimodal)
- Gemma 3 27B (L40S tp4 image-only)
- DeepSeek V3.2 (H200 dp8-moe-ep8)
- GPT-OSS-120B (H100-NVL tp4-moe-tp4)
- Kimi K2.5 (B300 tp8 image-only)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts + add Kimi K2.6

Add speculative decoding to all v0.20.1 presets where supported:

Native MTP (preferred over Eagle3 where available):
- Qwen3.5-9B: qwen3_next_mtp, num_spec=3, renamed to -mtp
- Qwen3.5-27B: qwen3_next_mtp, num_spec=3, renamed to -mtp
- Qwen3.5-397B: mtp (MoE variant), num_spec=3, renamed to -mtp
- DeepSeek V3.2: deepseek_mtp, num_spec=3, renamed to -mtp

Draft model / Eagle3 (no native MTP available):
- Gemma 4 31B: google/gemma-4-31B-it-assistant draft model, num_spec=3
- GPT-OSS-120B: nvidia/gpt-oss-120b-Eagle3-v2, eagle3, num_spec=3
- Kimi K2.5: lightseekorg/kimi-k2.5-eagle3, eagle3, num_spec=3

No spec decoding (not supported):
- Gemma 3 27B: no official drafter, ngram only (skipped)

Other changes:
- Kimi K2.5: enable video input (--limit-mm-per-prompt video=1) — vLLM
  kimi_k25 modeling code does support VisionChunkVideo, contrary to earlier
  assumption based only on HF model card. Actual behavior to be tested.
- New: Kimi K2.6 (B300 tp8) IST with Eagle3 draft model

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 6, 2026 11:40
@bongwoobak bongwoobak requested a review from a team as a code owner May 6, 2026 11:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new set of vLLM v0.20.1 Odin preset templates for multiple production models, enabling multimodal limits and speculative decoding where applicable, and introducing a new Kimi K2.6 preset.

Changes:

  • Added v0.20.1 InferenceServiceTemplate presets for Qwen3.5, GLM-5/5.1, DeepSeek V3.2, Gemma 3/4, GPT-OSS-120B, and Kimi K2.5/K2.6.
  • Enabled per-model speculative decoding configuration (MTP / Eagle3 drafter) and multimodal prompt limits for supported models.
  • Standardized presets to the unified vllm/vllm-openai:v0.20.1 image.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/zai-org-glm-5.1-fp8-mtp-nvidia-b300-tp8-moe-tp8.helm.yaml Adds GLM-5.1 FP8 TP8 MoE preset with MTP args on B300.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/zai-org-glm-5-fp8-mtp-nvidia-h200-sxm-tp8-moe-tp8.helm.yaml Adds GLM-5 FP8 TP8 MoE preset with MTP args on H200 SXM.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-9b-mtp-nvidia-l40s-1.helm.yaml Adds Qwen3.5-9B TP1 preset with multimodal limits and Qwen MTP config.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-397b-a17b-fp8-mtp-nvidia-b300-dp4-moe-ep4.helm.yaml Adds Qwen3.5-397B FP8 DP4+EP preset with multimodal limits and MTP.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/qwen-qwen3.5-27b-mtp-nvidia-h200-sxm-1.helm.yaml Adds Qwen3.5-27B TP1 preset with multimodal limits and Qwen MTP config.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/openai-gpt-oss-120b-nvidia-h100-nvl-tp4-moe-tp4.helm.yaml Adds GPT-OSS-120B TP4 MoE preset with Eagle3 drafter config.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.6-nvidia-b300-tp8.helm.yaml Adds new Kimi K2.6 TP8 preset with multimodal limits and Eagle3 drafter.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/moonshotai-kimi-k2.5-nvidia-b300-tp8.helm.yaml Adds Kimi K2.5 TP8 preset with multimodal limits and Eagle3 drafter.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/google-gemma-4-31b-it-nvidia-l40s-tp4.helm.yaml Adds Gemma 4 TP4 preset with image limits and speculative decoding config.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/google-gemma-3-27b-it-nvidia-l40s-tp4.helm.yaml Adds Gemma 3 TP4 preset with image limits.
deploy/helm/moai-inference-preset/templates/presets/vllm/v0.20.1/deepseek-ai-deepseek-v3.2-mtp-nvidia-h200-sxm-dp8-moe-ep8.helm.yaml Adds DeepSeek V3.2 DP8+EP preset with DeepSeek MTP config.

@bongwoobak bongwoobak changed the title NO-ISSUE: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding MAF-19809: feat(preset): add vLLM v0.20.1 presets with multimodal & speculative decoding May 6, 2026
bongwoobak added 2 commits May 8, 2026 14:30
- Qwen3.5-9B/27B: parallelism label "tp1" → "1" (single-device convention)
- Qwen3.5-397B (DP4+EP4): spec.template → spec.workerTemplate
- DeepSeek V3.2 (DP8+EP8): spec.template → spec.workerTemplate
- GPT-OSS-120B: remove --no-enable-prefix-caching (user-configurable knob,
  per AGENTS.md:183-189)
- Gemma 4: add explicit method=draft_model to --speculative-config
  (vLLM SpeculativeConfig auto-detects but explicit is safer)
Copilot AI review requested due to automatic review settings May 8, 2026 05:30
Copy link
Copy Markdown
Contributor

@nulledge nulledge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the diff — everything is redundant with the Copilot review, except the comment about Gemma 4.

It looks incorrect — omitting method with model set falls through to vLLM's auto-detect path, so speculative decoding stays on for Gemma 4.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.

bongwoobak and others added 2 commits May 11, 2026 02:27
Per @nulledge review: omitting `method` with `model` set falls through
to vLLM's auto-detect path (speculative.py:518-522, 650-688). The
explicit `method=draft_model` added in the previous commit was
unnecessary and not aligned with the official Gemma 4 recipe.
Match the upstream vLLM recipe (models/moonshotai/Kimi-K2.6.yaml):

- Switch Eagle3 draft from `lightseekorg/kimi-k2.6-eagle3` to
  `lightseekorg/kimi-k2.6-eagle3-mla` (MLA-tuned variant aligned with
  Kimi K2.6's MLA attention).
- Add `--attention-config.use_trtllm_ragged_deepseek_prefill=True` per
  the recipe's hardware_overrides.blackwell entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 11, 2026 09:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comment on lines +39 to +42
--reasoning-parser kimi_k2
--enable-auto-tool-choice
--speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3-mla","method":"eagle3","num_speculative_tokens":3}'
--attention-config.use_trtllm_ragged_deepseek_prefill=True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants