Skip to content

[kernel][moe] better splitK for fused moe#1603

Open
AlpinDale wants to merge 1 commit into
mainfrom
moe-splitk
Open

[kernel][moe] better splitK for fused moe#1603
AlpinDale wants to merge 1 commit into
mainfrom
moe-splitk

Conversation

@AlpinDale
Copy link
Copy Markdown
Collaborator

Minor improvement, in the range of ~0.4%

2x RTX 6000 Ada, Qwen3-30B-A3B-FP8, 512 tokens I/O

Main:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  148.28    
Total input tokens:                      512000    
Total generated tokens:                  508351    
Request throughput (req/s):              6.74      
Output token throughput (tok/s):         3428.32   
Peak output token throughput (tok/s):    5632.00   
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          6881.25   
---------------Time to First Token----------------
Mean TTFT (ms):                          61118.30  
Median TTFT (ms):                        49902.46  
P99 TTFT (ms):                           125346.87 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.72     
Median TPOT (ms):                        72.62     
P99 TPOT (ms):                           74.73     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.56     
Median ITL (ms):                         48.74     
P99 ITL (ms):                            259.87    
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  148.00    
Total input tokens:                      512000    
Total generated tokens:                  508677    
Request throughput (req/s):              6.76      
Output token throughput (tok/s):         3436.98   
Peak output token throughput (tok/s):    5600.00   
Peak concurrent requests:                1000.00   
Total Token throughput (tok/s):          6896.42   
---------------Time to First Token----------------
Mean TTFT (ms):                          60915.28  
Median TTFT (ms):                        49695.81  
P99 TTFT (ms):                           125183.94 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.55     
Median TPOT (ms):                        72.54     
P99 TPOT (ms):                           74.78     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.40     
Median ITL (ms):                         48.71     
P99 ITL (ms):                            256.49    
==================================================

Signed-off-by: AlpinDale <alpindale@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a split-K implementation for the fused MoE kernel to improve performance. The changes involve modifying the Triton kernel to handle split-K logic, including a 2D launch grid, atomic adds for reduction, and a pre-run hook to zero the output buffer. The host-side code is also updated to support launching the split-K kernel.

My review found a critical issue in the kernel launch logic that effectively disables the split-K optimization. I've provided a specific comment and suggestion to fix this. Once addressed, this PR should correctly enable the performance benefits of split-K.

Comment on lines 655 to +660
config["SPLIT_K"] = 1
BLOCK_SIZE_K = config.pop("BLOCK_SIZE_K")
if block_shape is not None:
BLOCK_SIZE_K = min(BLOCK_SIZE_K, min(block_shape[0], block_shape[1]))
if not do_split_k:
config["SPLIT_K"] = 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The line config["SPLIT_K"] = 1 unconditionally sets SPLIT_K to 1. This overrides any tuned value for SPLIT_K from the configuration and effectively disables the split-K optimization, as the kernel will always be launched with a grid dimension of 1 for the K-split axis. The subsequent check if not do_split_k: is then redundant when do_split_k is True.

To fix this, the unconditional assignment should be removed, and SPLIT_K should only be set to 1 if do_split_k is False.

Suggested change
config["SPLIT_K"] = 1
BLOCK_SIZE_K = config.pop("BLOCK_SIZE_K")
if block_shape is not None:
BLOCK_SIZE_K = min(BLOCK_SIZE_K, min(block_shape[0], block_shape[1]))
if not do_split_k:
config["SPLIT_K"] = 1
BLOCK_SIZE_K = config.pop("BLOCK_SIZE_K")
if block_shape is not None:
BLOCK_SIZE_K = min(BLOCK_SIZE_K, min(block_shape[0], block_shape[1]))
if not do_split_k:
config["SPLIT_K"] = 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant