NO-ISSUE: feat(preset): refresh quickstart presets and add Ministral-3-8B#129
Merged
Conversation
Drop deepseek-r1-distill-llama-8b, ibm-granite-3.3-8b-instruct, qwen2-0.5b-instruct, and qwen2.5-1.5b-instruct preset templates that are no longer maintained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…3-8B Refresh quickstart preset templates across maintained models: - Bump moreh-vllm image to v0.19.1.1 - Enable prefix caching by default - Add reasoning and tool-call parsers for gemma-4-31b-it (gemma4) - Add MTP speculative decoding to qwen3.6-27b - Add PD-disaggregated prefill/decode templates for qwen3.6-27b - Replace mistral-7b-instruct-v0.3 with Ministral-3-8B-Reasoning-2512, including mistral tokenizer/config/load format and reasoning parser Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Refreshes the Helm “quickstart” preset catalog for vLLM by removing unmaintained model presets, bumping the moreh-vllm image across maintained presets, and updating runtime flags for newer model/tooling + PD disaggregation (prefill/decode) workflows.
Changes:
- Removed several unmaintained quickstart presets (DeepSeek R1 distill, IBM Granite, Qwen2 0.5B, Qwen2.5 1.5B).
- Bumped
moreh-vllmimages tov0.19.1.1and enabled--enable-prefix-cachingbroadly; added model-specific tool/reasoning parser flags. - Added/updated Qwen3.6-27B PD prefill/decode presets (including Nixl KV transfer) and replaced Mistral-7B preset with Ministral-3-8B-Reasoning-2512.
Reviewed changes
Copilot reviewed 71 out of 71 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-prefill-amd-mi300x-tp2.helm.yaml | New/updated Qwen3.6-27B prefill preset; image bump + prefix caching + PD args. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-prefill-amd-mi250-tp2.helm.yaml | New/updated Qwen3.6-27B prefill preset for MI250. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-decode-amd-mi300x-tp2.helm.yaml | New/updated Qwen3.6-27B decode preset; KV consumer + PD args. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-decode-amd-mi250-tp2.helm.yaml | New/updated Qwen3.6-27B decode preset for MI250. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-amd-mi300x-tp2.helm.yaml | Qwen3.6-27B e2e preset; image bump + prefix caching + speculative config. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-amd-mi250-tp2.helm.yaml | Qwen3.6-27B e2e preset for MI250; image bump + new flags. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-vl-8b-instruct-prefill-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (prefill). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-vl-8b-instruct-prefill-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (prefill, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-vl-8b-instruct-decode-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (decode). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-vl-8b-instruct-decode-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (decode, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-vl-8b-instruct-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (e2e). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-vl-8b-instruct-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (e2e, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-32b-prefill-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (prefill). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-32b-prefill-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (prefill, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-32b-decode-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (decode). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-32b-decode-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (decode, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-32b-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (e2e). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-32b-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (e2e, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-1.7b-prefill-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (prefill). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-1.7b-prefill-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (prefill, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-1.7b-decode-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (decode). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-1.7b-decode-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (decode, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-1.7b-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching (e2e). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3-1.7b-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching (e2e, MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen2.5-1.5b-instruct-amd-mi300x-tp2.helm.yaml | Removed unmaintained Qwen2.5-1.5B preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen2.5-1.5b-instruct-amd-mi250-tp2.helm.yaml | Removed unmaintained Qwen2.5-1.5B preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen2-0.5b-instruct-prefill-amd-mi300x-tp2.helm.yaml | Removed unmaintained Qwen2-0.5B prefill preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen2-0.5b-instruct-prefill-amd-mi250-tp2.helm.yaml | Removed unmaintained Qwen2-0.5B prefill preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen2-0.5b-instruct-decode-amd-mi300x-tp2.helm.yaml | Removed unmaintained Qwen2-0.5B decode preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen2-0.5b-instruct-decode-amd-mi250-tp2.helm.yaml | Removed unmaintained Qwen2-0.5B decode preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen2-0.5b-instruct-amd-mi300x-tp2.helm.yaml | Removed unmaintained Qwen2-0.5B e2e preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen2-0.5b-instruct-amd-mi250-tp2.helm.yaml | Removed unmaintained Qwen2-0.5B e2e preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-openai-gpt-oss-20b-prefill-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching for GPT-OSS-20B prefill. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-openai-gpt-oss-20b-prefill-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching for GPT-OSS-20B prefill (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-openai-gpt-oss-20b-decode-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching for GPT-OSS-20B decode. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-openai-gpt-oss-20b-decode-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching for GPT-OSS-20B decode (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-openai-gpt-oss-20b-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching for GPT-OSS-20B e2e. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-openai-gpt-oss-20b-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching for GPT-OSS-20B e2e (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-prefill-amd-mi300x-tp2.helm.yaml | Replaces Mistral-7B preset with Ministral-3-8B; adds Mistral-format args. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-prefill-amd-mi250-tp2.helm.yaml | Ministral prefill preset for MI250. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-decode-amd-mi300x-tp2.helm.yaml | Ministral decode preset; adds Mistral-format args + KV consumer. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-decode-amd-mi250-tp2.helm.yaml | Ministral decode preset for MI250. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-amd-mi300x-tp2.helm.yaml | Ministral e2e preset; adds Mistral-format args. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-amd-mi250-tp2.helm.yaml | Ministral e2e preset for MI250. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-microsoft-phi-mini-moe-instruct-prefill-amd-mi250-dp2-moe-tp2.helm.yaml | Image bump + enable prefix caching for Phi Mini MoE prefill. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-microsoft-phi-mini-moe-instruct-decode-amd-mi250-dp2-moe-tp2.helm.yaml | Image bump + enable prefix caching for Phi Mini MoE decode. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-microsoft-phi-mini-moe-instruct-amd-mi250-dp2-moe-tp2.helm.yaml | Image bump + enable prefix caching for Phi Mini MoE e2e. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-meta-llama-llama-3.2-1b-instruct-prefill-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching for Llama 3.2 1B prefill. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-meta-llama-llama-3.2-1b-instruct-prefill-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching for Llama 3.2 1B prefill (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-meta-llama-llama-3.2-1b-instruct-decode-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching for Llama 3.2 1B decode. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-meta-llama-llama-3.2-1b-instruct-decode-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching for Llama 3.2 1B decode (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi300x-tp2.helm.yaml | Image bump + enable prefix caching for Llama 3.2 1B e2e. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-meta-llama-llama-3.2-1b-instruct-amd-mi250-tp2.helm.yaml | Image bump + enable prefix caching for Llama 3.2 1B e2e (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-ibm-granite-granite-3.3-8b-instruct-prefill-amd-mi300x-tp2.helm.yaml | Removed unmaintained IBM Granite prefill preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-ibm-granite-granite-3.3-8b-instruct-prefill-amd-mi250-tp2.helm.yaml | Removed unmaintained IBM Granite prefill preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-ibm-granite-granite-3.3-8b-instruct-decode-amd-mi300x-tp2.helm.yaml | Removed unmaintained IBM Granite decode preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-ibm-granite-granite-3.3-8b-instruct-decode-amd-mi250-tp2.helm.yaml | Removed unmaintained IBM Granite decode preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-ibm-granite-granite-3.3-8b-instruct-amd-mi300x-tp2.helm.yaml | Removed unmaintained IBM Granite e2e preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-ibm-granite-granite-3.3-8b-instruct-amd-mi250-tp2.helm.yaml | Removed unmaintained IBM Granite e2e preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-google-gemma-4-31b-it-prefill-amd-mi300x-tp2.helm.yaml | Image bump + prefix caching + Gemma4 tool/reasoning parser flags (prefill). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-google-gemma-4-31b-it-prefill-amd-mi250-tp2.helm.yaml | Same Gemma4 updates for MI250 (prefill). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-google-gemma-4-31b-it-decode-amd-mi300x-tp2.helm.yaml | Same Gemma4 updates (decode). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-google-gemma-4-31b-it-decode-amd-mi250-tp2.helm.yaml | Same Gemma4 updates for MI250 (decode). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-google-gemma-4-31b-it-amd-mi300x-tp2.helm.yaml | Same Gemma4 updates (e2e). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-google-gemma-4-31b-it-amd-mi250-tp2.helm.yaml | Same Gemma4 updates for MI250 (e2e). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-distill-llama-8b-prefill-amd-mi300x-tp2.helm.yaml | Removed unmaintained DeepSeek R1 distill prefill preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-distill-llama-8b-prefill-amd-mi250-tp2.helm.yaml | Removed unmaintained DeepSeek R1 distill prefill preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-distill-llama-8b-decode-amd-mi300x-tp2.helm.yaml | Removed unmaintained DeepSeek R1 distill decode preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-distill-llama-8b-decode-amd-mi250-tp2.helm.yaml | Removed unmaintained DeepSeek R1 distill decode preset (MI250). |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-distill-llama-8b-amd-mi300x-tp2.helm.yaml | Removed unmaintained DeepSeek R1 distill e2e preset. |
| deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-deepseek-ai-deepseek-r1-distill-llama-8b-amd-mi250-tp2.helm.yaml | Removed unmaintained DeepSeek R1 distill e2e preset (MI250). |
Comments suppressed due to low confidence (10)
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-prefill-amd-mi300x-tp2.helm.yaml:34
ISVC_EXTRA_ARGSuses--tokenizer_mode/--config_format/--load_formatwith underscores. Existing vLLM presets in this repo use hyphenated flags (e.g.,--tokenizer-mode), and argparse typically will not accept the underscore variants, which would prevent the container from starting. Please switch these flags to the hyphenated form (and verify the exact option names supported by themoreh-vllm:v0.19.1.1entrypoint).
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-prefill-amd-mi250-tp2.helm.yaml:34ISVC_EXTRA_ARGSuses--tokenizer_mode/--config_format/--load_formatwith underscores. Existing vLLM presets in this repo use hyphenated flags (e.g.,--tokenizer-mode), and argparse typically will not accept the underscore variants, which would prevent the container from starting. Please switch these flags to the hyphenated form (and verify the exact option names supported by themoreh-vllm:v0.19.1.1entrypoint).
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-decode-amd-mi300x-tp2.helm.yaml:34ISVC_EXTRA_ARGSuses--tokenizer_mode/--config_format/--load_formatwith underscores. Existing vLLM presets in this repo use hyphenated flags (e.g.,--tokenizer-mode), and argparse typically will not accept the underscore variants, which would prevent the container from starting. Please switch these flags to the hyphenated form (and verify the exact option names supported by themoreh-vllm:v0.19.1.1entrypoint).
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-decode-amd-mi250-tp2.helm.yaml:34ISVC_EXTRA_ARGSuses--tokenizer_mode/--config_format/--load_formatwith underscores. Existing vLLM presets in this repo use hyphenated flags (e.g.,--tokenizer-mode), and argparse typically will not accept the underscore variants, which would prevent the container from starting. Please switch these flags to the hyphenated form (and verify the exact option names supported by themoreh-vllm:v0.19.1.1entrypoint).
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-amd-mi300x-tp2.helm.yaml:34ISVC_EXTRA_ARGSuses--tokenizer_mode/--config_format/--load_formatwith underscores. Existing vLLM presets in this repo use hyphenated flags (e.g.,--tokenizer-mode), and argparse typically will not accept the underscore variants, which would prevent the container from starting. Please switch these flags to the hyphenated form (and verify the exact option names supported by themoreh-vllm:v0.19.1.1entrypoint).
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-mistralai-ministral-3-8b-reasoning-2512-amd-mi250-tp2.helm.yaml:34ISVC_EXTRA_ARGSuses--tokenizer_mode/--config_format/--load_formatwith underscores. Existing vLLM presets in this repo use hyphenated flags (e.g.,--tokenizer-mode), and argparse typically will not accept the underscore variants, which would prevent the container from starting. Please switch these flags to the hyphenated form (and verify the exact option names supported by themoreh-vllm:v0.19.1.1entrypoint).
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-prefill-amd-mi300x-tp2.helm.yaml:33--trust-remote-codeallows execution of arbitrary Python from the model repository at load time. If this is required for Qwen3.6, please pin the model code to a specific, vetted revision and ensure the runtime environment is appropriately locked down (e.g., restricted egress / hardened pod security), otherwise remove this flag.
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-prefill-amd-mi250-tp2.helm.yaml:33--trust-remote-codeallows execution of arbitrary Python from the model repository at load time. If this is required for Qwen3.6, please pin the model code to a specific, vetted revision and ensure the runtime environment is appropriately locked down (e.g., restricted egress / hardened pod security), otherwise remove this flag.
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-decode-amd-mi300x-tp2.helm.yaml:33--trust-remote-codeallows execution of arbitrary Python from the model repository at load time. If this is required for Qwen3.6, please pin the model code to a specific, vetted revision and ensure the runtime environment is appropriately locked down (e.g., restricted egress / hardened pod security), otherwise remove this flag.
deploy/helm/moai-inference-preset/templates/presets/quickstart/quickstart-vllm-qwen-qwen3.6-27b-decode-amd-mi250-tp2.helm.yaml:33--trust-remote-codeallows execution of arbitrary Python from the model repository at load time. If this is required for Qwen3.6, please pin the model code to a specific, vetted revision and ensure the runtime environment is appropriately locked down (e.g., restricted egress / hardened pod security), otherwise remove this flag.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
deepseek-r1-distill-llama-8b,ibm-granite-3.3-8b-instruct,qwen2-0.5b-instruct,qwen2.5-1.5b-instruct).moreh-vllmimage tov0.19.1.1(fromv0.17.1.1/v0.19.1.0) across the maintained quickstart presets.--enable-prefix-caching.--enable-auto-tool-choice,--reasoning-parser gemma4,--tool-call-parser gemma4.--speculative-config '{"method":"mtp","num_speculative_tokens":3}'and new prefill/decode templates for PD disaggregation.mistralai/Mistral-7B-Instruct-v0.3withmistralai/Ministral-3-8B-Reasoning-2512, configured with--tokenizer_mode mistral,--config_format mistral,--load_format mistral,--enable-auto-tool-choice,--tool-call-parser mistral, and--reasoning-parser mistral.Test Plan
helm lint deploy/helm/moai-inference-presetpasses.helm template deploy/helm/moai-inference-presetrenders the new templates including the--speculative-configand--kv-transfer-configJSON arguments without escaping issues.NixlConnector.