[kernel][moe] better splitK for fused moe by AlpinDale · Pull Request #1603 · dphnAI/aphrodite-engine

AlpinDale · 2025-11-05T20:20:24Z

Minor improvement, in the range of ~0.4%

2x RTX 6000 Ada, Qwen3-30B-A3B-FP8, 512 tokens I/O

Main:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  148.28    
Total input tokens:                      512000    
Total generated tokens:                  508351    
Request throughput (req/s):              6.74      
Output token throughput (tok/s):         3428.32   
Peak output token throughput (tok/s):    5632.00   
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          6881.25   
---------------Time to First Token----------------
Mean TTFT (ms):                          61118.30  
Median TTFT (ms):                        49902.46  
P99 TTFT (ms):                           125346.87 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.72     
Median TPOT (ms):                        72.62     
P99 TPOT (ms):                           74.73     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.56     
Median ITL (ms):                         48.74     
P99 ITL (ms):                            259.87    
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  148.00    
Total input tokens:                      512000    
Total generated tokens:                  508677    
Request throughput (req/s):              6.76      
Output token throughput (tok/s):         3436.98   
Peak output token throughput (tok/s):    5600.00   
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          6896.42   
---------------Time to First Token----------------
Mean TTFT (ms):                          60915.28  
Median TTFT (ms):                        49695.81  
P99 TTFT (ms):                           125183.94 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.55     
Median TPOT (ms):                        72.54     
P99 TPOT (ms):                           74.78     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.40     
Median ITL (ms):                         48.71     
P99 ITL (ms):                            256.49    
==================================================

Signed-off-by: AlpinDale <alpindale@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a split-K implementation for the fused MoE kernel to improve performance. The changes involve modifying the Triton kernel to handle split-K logic, including a 2D launch grid, atomic adds for reduction, and a pre-run hook to zero the output buffer. The host-side code is also updated to support launching the split-K kernel.

My review found a critical issue in the kernel launch logic that effectively disables the split-K optimization. I've provided a specific comment and suggestion to fix this. Once addressed, this PR should correctly enable the performance benefits of split-K.

gemini-code-assist · 2025-11-05T20:23:45Z

        config["SPLIT_K"] = 1
        BLOCK_SIZE_K = config.pop("BLOCK_SIZE_K")
        if block_shape is not None:
            BLOCK_SIZE_K = min(BLOCK_SIZE_K, min(block_shape[0], block_shape[1]))
+        if not do_split_k:
+            config["SPLIT_K"] = 1


The line config["SPLIT_K"] = 1 unconditionally sets SPLIT_K to 1. This overrides any tuned value for SPLIT_K from the configuration and effectively disables the split-K optimization, as the kernel will always be launched with a grid dimension of 1 for the K-split axis. The subsequent check if not do_split_k: is then redundant when do_split_k is True.

To fix this, the unconditional assignment should be removed, and SPLIT_K should only be set to 1 if do_split_k is False.

Suggested change

config["SPLIT_K"] = 1

BLOCK_SIZE_K = config.pop("BLOCK_SIZE_K")

if block_shape is not None:

BLOCK_SIZE_K = min(BLOCK_SIZE_K, min(block_shape[0], block_shape[1]))

if not do_split_k:

config["SPLIT_K"] = 1

BLOCK_SIZE_K = config.pop("BLOCK_SIZE_K")

if block_shape is not None:

BLOCK_SIZE_K = min(BLOCK_SIZE_K, min(block_shape[0], block_shape[1]))

if not do_split_k:

config["SPLIT_K"] = 1

[kernel][moe] better splitK for fused moe

4b1a350

Signed-off-by: AlpinDale <alpindale@gmail.com>

gemini-code-assist Bot reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[kernel][moe] better splitK for fused moe#1603

[kernel][moe] better splitK for fused moe#1603
AlpinDale wants to merge 1 commit into
mainfrom
moe-splitk

AlpinDale commented Nov 5, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AlpinDale commented Nov 5, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant