chore: sync to upstream 985961345a13f3e3bb15d29c94b011ba9a6b858b#1666
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for several new models, including CohereMoe, Laguna, MiMoV2Omni, and Moondream3, along with their respective reasoning and tool parsers. Key performance improvements include multi-stream GEMM overlap for DeepSeek-V4 and the introduction of ROCm AITER fusion passes. Additionally, it enables prompt_embeds in chat completions and introduces a system fingerprinting feature. Feedback highlights two significant concerns: the removal of early quant_dtype validation in all-to-all utilities, which may cause inefficient resource allocation before an error occurs, and the removal of router_logits_dtype validation in the NVFP4 MoE implementation, potentially leading to silent failures or incorrect computations when incompatible types are used.
| max_num_tokens = get_current_aphrodite_config().scheduler_config.max_num_batched_tokens | ||
| if quant_config.quant_dtype is None: | ||
| dispatch_dtype_bytes_per_elem = 2 | ||
| dispatch_scale_bytes_per_token = 0 | ||
| elif quant_config.quant_dtype == "nvfp4": | ||
| dispatch_dtype_bytes_per_elem = 0 | ||
| dispatch_scale_bytes_per_token = moe.hidden_dim // 16 | ||
| elif quant_config.quant_dtype == "mxfp8": |
There was a problem hiding this comment.
The previous check if quant_config.quant_dtype != "nvfp4": was removed, which could lead to runtime errors if an unsupported quant_dtype is passed to FlashInferNVLinkOneSidedPrepareAndFinalize. While a NotImplementedError is added later, it is better to validate this early to avoid unnecessary setup.
| def _supports_router_logits_dtype( | ||
| router_logits_dtype: torch.dtype | None, | ||
| routing_method: RoutingMethodType, | ||
| ) -> bool: | ||
| """ | ||
| The FlashInfer TRTLLM NvFp4 kernel expects bfloat16 router_logits by default. | ||
| DeepSeekV3 routing supports float32 router_logits (converted internally). | ||
| Simulated routing generates synthetic decisions and is agnostic to dtype. | ||
| """ | ||
| if router_logits_dtype == torch.float32: | ||
| # DeepSeekV3 routing handles float32 logits internally. | ||
| # Simulated routing generates synthetic decisions, so the | ||
| # kernel doesn't care about the actual logits dtype. | ||
| # https://github.com/flashinfer-ai/flashinfer/issues/2469 | ||
| return routing_method in ( | ||
| RoutingMethodType.DeepSeekV3, | ||
| RoutingMethodType.Simulated, | ||
| ) | ||
| return True |
There was a problem hiding this comment.
No description provided.