fix kernelcofings. by hiworldwzj · Pull Request #1240 · ModelTC/LightLLM

hiworldwzj · 2026-03-19T08:30:47Z

No description provided.

gemini-code-assist · 2026-03-19T08:31:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the kernel configuration management by removing several dedicated *KernelConfig classes and their associated dynamic lookup mechanisms. The change simplifies kernel initialization by either embedding static default configurations directly into the kernel functions or, in the case of bmm_scaled_fp8, by adopting a decorator-based autotuning approach. This streamlines the codebase and removes dependencies on the previous configuration system.

Highlights

Kernel Configuration Refactoring: Removed several *KernelConfig classes (e.g., MoeGroupedGemmKernelConfig, MoeSiluAndMulKernelConfig, MoeSumReduceKernelConfig, DeepseekV3RotaryKernelConfig, MropeTritonFusedKernelConfig) that were previously used for dynamic kernel configuration lookup.
Static Kernel Configurations: Replaced dynamic configuration lookups in grouped_fused_moe.py, moe_silu_and_mul.py, moe_sum_reduce.py, rotary_emb.py, and mrope.py with hardcoded default run_config dictionaries, simplifying the kernel setup.
Autotuning Integration for FP8 BMM: Integrated the autotune decorator directly into the bmm_scaled_fp8 function, replacing its dedicated BmmScaledFp8KernelConfig class and streamlining its performance tuning.
Removal of Tuning Test Files: Deleted obsolete tuning test scripts (deepseekv2_bmm_scaled_fp8_tuning.py, fuse_moe_tuning.py, moe_silu_and_mul_tuning_bf16.py) that were associated with the removed kernel configuration classes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors kernel configuration management. For bmm_scaled_fp8, the tuning logic has been successfully integrated using the @autotune decorator, which is a positive architectural improvement. However, for other performance-critical kernels such as grouped_matmul, silu_and_mul_fwd, moe_sum_reduce, and rotary_emb_fwd, the dynamic configuration loading and autotuning mechanisms have been replaced with hardcoded run_config dictionaries. While this simplifies the code by removing external configuration files and tuning scripts, it introduces a significant risk of performance regression. The hardcoded configurations might not be optimal across different hardware or input shapes, and the ability to dynamically tune and cache optimal configurations, which is crucial for these kernels, has been removed. This could lead to suboptimal performance in various deployment scenarios.

I am having trouble creating individual review comments. Click here to see my feedback.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_recude_config.py (1-59)

The removal of moe_sum_recude_config.py eliminates the dedicated configuration class for the sum-reduce kernel. This means the kernel loses its ability to dynamically load and save optimal configurations, potentially leading to performance degradation.

lightllm/models/qwen2_vl/triton_kernel/mrope.py (16-67)

The removal of MropeTritonFusedKernelConfig eliminates the dedicated configuration class for the MROPE kernel. This means the kernel loses its ability to dynamically load and save optimal configurations, potentially leading to performance degradation.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_kernel_configs.py (1-88)

The removal of moe_kernel_configs.py eliminates the dedicated class for managing and autotuning kernel configurations. This loss of a structured configuration mechanism can negatively impact performance and maintainability, as optimal settings are now hardcoded or absent.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul.py (5)

Removing the import for MoeSiluAndMulKernelConfig indicates the removal of dynamic configuration for this kernel. This could lead to performance issues if the hardcoded configurations are not optimal for all scenarios.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul.py (123-126)

Hardcoding the run_config based on size_m removes the ability to dynamically tune and load optimal configurations. This might lead to performance degradation for inputs where these specific hardcoded values are not the most efficient.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul_config.py (1-53)

The removal of moe_silu_and_mul_config.py means that the kernel no longer benefits from dynamic configuration and autotuning. This could lead to suboptimal performance as the kernel cannot adapt to different execution environments or input characteristics.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_reduce.py (5)

The removal of MoeSumReduceKernelConfig import suggests that the dynamic configuration for this kernel has been removed. This can impact performance by preventing the kernel from using optimized settings for different inputs or hardware.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_reduce.py (78-83)

Hardcoding the run_config for moe_sum_reduce removes the flexibility to use dynamically tuned configurations. This could lead to performance bottlenecks, especially for varying token_num, topk_num, or hidden_dim values.

lightllm/common/basemodel/triton_kernel/fused_moe/grouped_fused_moe.py (728-747)

Replacing the dynamic try_to_get_best_config with hardcoded run_config values removes the flexibility to adapt to different hardware and input shapes. This could result in suboptimal performance, especially for varying token_inputs.shape[0] values, as the hardcoded values might not be universally optimal.

test/kernel/moe_silu_and_mul_tuning_bf16.py (1-217)

The removal of moe_silu_and_mul_tuning_bf16.py means that the tuning process for the moe_silu_and_mul kernel is no longer available. This directly correlates with the hardcoding of run_config in moe_silu_and_mul.py and removes the ability to find and apply optimal configurations, which is a high risk for performance.

lightllm/common/basemodel/triton_kernel/fused_moe/grouped_fused_moe.py (28)

The removal of MoeGroupedGemmKernelConfig means that the dynamic loading and caching of optimal kernel configurations are no longer available. This could lead to performance regressions as the system will not be able to adapt to different hardware or input characteristics.

lightllm/models/deepseek2/triton_kernel/rotary_emb.py (138-141)

Hardcoding the run_config based on total_len removes the ability to dynamically tune and load optimal configurations for the rotary embedding. This might lead to performance degradation for varying sequence lengths or hardware.

lightllm/models/deepseek2/triton_kernel/rotary_emb_config.py (1-61)

The removal of rotary_emb_config.py eliminates the dedicated configuration class for the rotary embedding kernel. This means the kernel loses its ability to dynamically load and save optimal configurations, potentially leading to performance degradation.

lightllm/models/deepseek2/triton_kernel/rotary_emb.py (137)

The removal of DeepseekV3RotaryKernelConfig import means that the rotary embedding kernel will no longer use dynamic configuration loading. This could lead to performance issues if the hardcoded configurations are not optimal for all use cases.

lightllm/models/qwen2_vl/triton_kernel/mrope.py (144-145)

Hardcoding the run_config for mrope_triton_fused removes the ability to dynamically tune and load optimal configurations. This might lead to performance degradation, especially for varying input characteristics.

test/kernel/fuse_moe_tuning.py (1-501)

The removal of fuse_moe_tuning.py indicates that the dedicated tuning process for fused MoE kernels is no longer available. This directly correlates with the hardcoding of run_config in grouped_fused_moe.py and removes the ability to find and apply optimal configurations, which is a high risk for performance.

lightllm/common/basemodel/triton_kernel/quantization/bmm_scaled_fp8.py (155-162)

While the @autotune decorator is a good addition, this hardcoded run_config acts as a fixed fallback. If the autotuner is disabled or fails to find a configuration, this static configuration will be used. This is less flexible than loading from a cached configuration, potentially leading to suboptimal performance in certain scenarios.

hiworldwzj added 6 commits March 19, 2026 08:10

fix

47ebe08

fix

6f7cd7f

fix

5637fd8

fix

72bf768

fix

af773f2

fix

34b7582

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

hiworldwzj merged commit c42f9a5 into main Mar 19, 2026
1 check passed

hiworldwzj deleted the wzj_fix branch March 19, 2026 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix kernelcofings.#1240

fix kernelcofings.#1240
hiworldwzj merged 6 commits intomainfrom
wzj_fix

hiworldwzj commented Mar 19, 2026

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hiworldwzj commented Mar 19, 2026

Uh oh!

gemini-code-assist bot commented Mar 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_recude_config.py (1-59)

lightllm/models/qwen2_vl/triton_kernel/mrope.py (16-67)

lightllm/common/basemodel/triton_kernel/fused_moe/moe_kernel_configs.py (1-88)

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul.py (5)

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul.py (123-126)

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul_config.py (1-53)

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_reduce.py (5)

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_reduce.py (78-83)

lightllm/common/basemodel/triton_kernel/fused_moe/grouped_fused_moe.py (728-747)

test/kernel/moe_silu_and_mul_tuning_bf16.py (1-217)

lightllm/common/basemodel/triton_kernel/fused_moe/grouped_fused_moe.py (28)

lightllm/models/deepseek2/triton_kernel/rotary_emb.py (138-141)

lightllm/models/deepseek2/triton_kernel/rotary_emb_config.py (1-61)

lightllm/models/deepseek2/triton_kernel/rotary_emb.py (137)

lightllm/models/qwen2_vl/triton_kernel/mrope.py (144-145)

test/kernel/fuse_moe_tuning.py (1-501)

lightllm/common/basemodel/triton_kernel/quantization/bmm_scaled_fp8.py (155-162)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant