Skip to content

fix kernelcofings.#1240

Merged
hiworldwzj merged 6 commits intomainfrom
wzj_fix
Mar 19, 2026
Merged

fix kernelcofings.#1240
hiworldwzj merged 6 commits intomainfrom
wzj_fix

Conversation

@hiworldwzj
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the kernel configuration management by removing several dedicated *KernelConfig classes and their associated dynamic lookup mechanisms. The change simplifies kernel initialization by either embedding static default configurations directly into the kernel functions or, in the case of bmm_scaled_fp8, by adopting a decorator-based autotuning approach. This streamlines the codebase and removes dependencies on the previous configuration system.

Highlights

  • Kernel Configuration Refactoring: Removed several *KernelConfig classes (e.g., MoeGroupedGemmKernelConfig, MoeSiluAndMulKernelConfig, MoeSumReduceKernelConfig, DeepseekV3RotaryKernelConfig, MropeTritonFusedKernelConfig) that were previously used for dynamic kernel configuration lookup.
  • Static Kernel Configurations: Replaced dynamic configuration lookups in grouped_fused_moe.py, moe_silu_and_mul.py, moe_sum_reduce.py, rotary_emb.py, and mrope.py with hardcoded default run_config dictionaries, simplifying the kernel setup.
  • Autotuning Integration for FP8 BMM: Integrated the autotune decorator directly into the bmm_scaled_fp8 function, replacing its dedicated BmmScaledFp8KernelConfig class and streamlining its performance tuning.
  • Removal of Tuning Test Files: Deleted obsolete tuning test scripts (deepseekv2_bmm_scaled_fp8_tuning.py, fuse_moe_tuning.py, moe_silu_and_mul_tuning_bf16.py) that were associated with the removed kernel configuration classes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors kernel configuration management. For bmm_scaled_fp8, the tuning logic has been successfully integrated using the @autotune decorator, which is a positive architectural improvement. However, for other performance-critical kernels such as grouped_matmul, silu_and_mul_fwd, moe_sum_reduce, and rotary_emb_fwd, the dynamic configuration loading and autotuning mechanisms have been replaced with hardcoded run_config dictionaries. While this simplifies the code by removing external configuration files and tuning scripts, it introduces a significant risk of performance regression. The hardcoded configurations might not be optimal across different hardware or input shapes, and the ability to dynamically tune and cache optimal configurations, which is crucial for these kernels, has been removed. This could lead to suboptimal performance in various deployment scenarios.

I am having trouble creating individual review comments. Click here to see my feedback.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_recude_config.py (1-59)

high

The removal of moe_sum_recude_config.py eliminates the dedicated configuration class for the sum-reduce kernel. This means the kernel loses its ability to dynamically load and save optimal configurations, potentially leading to performance degradation.

lightllm/models/qwen2_vl/triton_kernel/mrope.py (16-67)

high

The removal of MropeTritonFusedKernelConfig eliminates the dedicated configuration class for the MROPE kernel. This means the kernel loses its ability to dynamically load and save optimal configurations, potentially leading to performance degradation.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_kernel_configs.py (1-88)

high

The removal of moe_kernel_configs.py eliminates the dedicated class for managing and autotuning kernel configurations. This loss of a structured configuration mechanism can negatively impact performance and maintainability, as optimal settings are now hardcoded or absent.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul.py (5)

high

Removing the import for MoeSiluAndMulKernelConfig indicates the removal of dynamic configuration for this kernel. This could lead to performance issues if the hardcoded configurations are not optimal for all scenarios.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul.py (123-126)

high

Hardcoding the run_config based on size_m removes the ability to dynamically tune and load optimal configurations. This might lead to performance degradation for inputs where these specific hardcoded values are not the most efficient.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_silu_and_mul_config.py (1-53)

high

The removal of moe_silu_and_mul_config.py means that the kernel no longer benefits from dynamic configuration and autotuning. This could lead to suboptimal performance as the kernel cannot adapt to different execution environments or input characteristics.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_reduce.py (5)

high

The removal of MoeSumReduceKernelConfig import suggests that the dynamic configuration for this kernel has been removed. This can impact performance by preventing the kernel from using optimized settings for different inputs or hardware.

lightllm/common/basemodel/triton_kernel/fused_moe/moe_sum_reduce.py (78-83)

high

Hardcoding the run_config for moe_sum_reduce removes the flexibility to use dynamically tuned configurations. This could lead to performance bottlenecks, especially for varying token_num, topk_num, or hidden_dim values.

lightllm/common/basemodel/triton_kernel/fused_moe/grouped_fused_moe.py (728-747)

high

Replacing the dynamic try_to_get_best_config with hardcoded run_config values removes the flexibility to adapt to different hardware and input shapes. This could result in suboptimal performance, especially for varying token_inputs.shape[0] values, as the hardcoded values might not be universally optimal.

test/kernel/moe_silu_and_mul_tuning_bf16.py (1-217)

high

The removal of moe_silu_and_mul_tuning_bf16.py means that the tuning process for the moe_silu_and_mul kernel is no longer available. This directly correlates with the hardcoding of run_config in moe_silu_and_mul.py and removes the ability to find and apply optimal configurations, which is a high risk for performance.

lightllm/common/basemodel/triton_kernel/fused_moe/grouped_fused_moe.py (28)

high

The removal of MoeGroupedGemmKernelConfig means that the dynamic loading and caching of optimal kernel configurations are no longer available. This could lead to performance regressions as the system will not be able to adapt to different hardware or input characteristics.

lightllm/models/deepseek2/triton_kernel/rotary_emb.py (138-141)

high

Hardcoding the run_config based on total_len removes the ability to dynamically tune and load optimal configurations for the rotary embedding. This might lead to performance degradation for varying sequence lengths or hardware.

lightllm/models/deepseek2/triton_kernel/rotary_emb_config.py (1-61)

high

The removal of rotary_emb_config.py eliminates the dedicated configuration class for the rotary embedding kernel. This means the kernel loses its ability to dynamically load and save optimal configurations, potentially leading to performance degradation.

lightllm/models/deepseek2/triton_kernel/rotary_emb.py (137)

high

The removal of DeepseekV3RotaryKernelConfig import means that the rotary embedding kernel will no longer use dynamic configuration loading. This could lead to performance issues if the hardcoded configurations are not optimal for all use cases.

lightllm/models/qwen2_vl/triton_kernel/mrope.py (144-145)

high

Hardcoding the run_config for mrope_triton_fused removes the ability to dynamically tune and load optimal configurations. This might lead to performance degradation, especially for varying input characteristics.

test/kernel/fuse_moe_tuning.py (1-501)

high

The removal of fuse_moe_tuning.py indicates that the dedicated tuning process for fused MoE kernels is no longer available. This directly correlates with the hardcoding of run_config in grouped_fused_moe.py and removes the ability to find and apply optimal configurations, which is a high risk for performance.

lightllm/common/basemodel/triton_kernel/quantization/bmm_scaled_fp8.py (155-162)

medium

While the @autotune decorator is a good addition, this hardcoded run_config acts as a fixed fallback. If the autotuner is disabled or fails to find a configuration, this static configuration will be used. This is less flexible than loading from a cached configuration, potentially leading to suboptimal performance in certain scenarios.

@hiworldwzj hiworldwzj merged commit c42f9a5 into main Mar 19, 2026
1 check passed
@hiworldwzj hiworldwzj deleted the wzj_fix branch March 19, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant