Skip to content

chore: sync to upstream 985961345a13f3e3bb15d29c94b011ba9a6b858b#1666

Merged
AlpinDale merged 1 commit into
mainfrom
sync/vllm-0c99629
May 2, 2026
Merged

chore: sync to upstream 985961345a13f3e3bb15d29c94b011ba9a6b858b#1666
AlpinDale merged 1 commit into
mainfrom
sync/vllm-0c99629

Conversation

@AlpinDale
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for several new models, including CohereMoe, Laguna, MiMoV2Omni, and Moondream3, along with their respective reasoning and tool parsers. Key performance improvements include multi-stream GEMM overlap for DeepSeek-V4 and the introduction of ROCm AITER fusion passes. Additionally, it enables prompt_embeds in chat completions and introduces a system fingerprinting feature. Feedback highlights two significant concerns: the removal of early quant_dtype validation in all-to-all utilities, which may cause inefficient resource allocation before an error occurs, and the removal of router_logits_dtype validation in the NVFP4 MoE implementation, potentially leading to silent failures or incorrect computations when incompatible types are used.

Comment on lines 223 to +230
max_num_tokens = get_current_aphrodite_config().scheduler_config.max_num_batched_tokens
if quant_config.quant_dtype is None:
dispatch_dtype_bytes_per_elem = 2
dispatch_scale_bytes_per_token = 0
elif quant_config.quant_dtype == "nvfp4":
dispatch_dtype_bytes_per_elem = 0
dispatch_scale_bytes_per_token = moe.hidden_dim // 16
elif quant_config.quant_dtype == "mxfp8":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous check if quant_config.quant_dtype != "nvfp4": was removed, which could lead to runtime errors if an unsupported quant_dtype is passed to FlashInferNVLinkOneSidedPrepareAndFinalize. While a NotImplementedError is added later, it is better to validate this early to avoid unnecessary setup.

Comment on lines 259 to 263
def _supports_router_logits_dtype(
router_logits_dtype: torch.dtype | None,
routing_method: RoutingMethodType,
) -> bool:
"""
The FlashInfer TRTLLM NvFp4 kernel expects bfloat16 router_logits by default.
DeepSeekV3 routing supports float32 router_logits (converted internally).
Simulated routing generates synthetic decisions and is agnostic to dtype.
"""
if router_logits_dtype == torch.float32:
# DeepSeekV3 routing handles float32 logits internally.
# Simulated routing generates synthetic decisions, so the
# kernel doesn't care about the actual logits dtype.
# https://github.com/flashinfer-ai/flashinfer/issues/2469
return routing_method in (
RoutingMethodType.DeepSeekV3,
RoutingMethodType.Simulated,
)
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _supports_router_logits_dtype method now unconditionally returns True, removing the validation that previously ensured router_logits_dtype was compatible with the kernel. This could lead to silent failures or incorrect computations if incompatible dtypes are used.

@AlpinDale AlpinDale merged commit 18f852d into main May 2, 2026
1 check failed
@AlpinDale AlpinDale deleted the sync/vllm-0c99629 branch May 2, 2026 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant