Skip to content

Sync MOE layer input quantizer only#903

Open
jenchen13 wants to merge 4 commits intomainfrom
jennifchen/moe_sync_input
Open

Sync MOE layer input quantizer only#903
jenchen13 wants to merge 4 commits intomainfrom
jennifchen/moe_sync_input

Conversation

@jenchen13
Copy link
Contributor

@jenchen13 jenchen13 commented Feb 18, 2026

What does this PR do?

Type of change: Bug fix

Overview: in MOE layer we currently sync both the weight and input quantizers so that all experts have the same weight amaxes and activation amaxes.

VLLM/TRTLLM actually support non-uniform weight amaxes in MOE so we only need to sync the activation amaxes.

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • Bug Fixes

    • Improved input quantizer synchronization for Mixture of Experts models to ensure correct amax value handling across local experts.
  • Documentation

    • Fixed typos and clarified wording in quantization documentation.
  • Tests

    • Added test coverage for Mixture of Experts quantizer synchronization functionality.

@jenchen13 jenchen13 requested a review from a team as a code owner February 18, 2026 17:32
@jenchen13 jenchen13 requested a review from realAsma February 18, 2026 17:32
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change restricts layer_sync_moe_local_experts_amax to exclusively process input_quantizer amax values from TensorQuantizer instances, preventing unintended synchronization of other quantizers. Documentation is updated for clarity, and a new test validates the synchronization behavior across local MoE experts.

Changes

Cohort / File(s) Summary
MoE Quantizer Synchronization Logic
modelopt/torch/quantization/plugins/megatron.py
Modified layer_sync_moe_local_experts_amax to filter and sync only input_quantizer amax values instead of all TensorQuantizer instances. Fixed documentation typos (garuantee, synchonizing, lyaer).
Test Coverage for MoE Quantizer Sync
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py
Added test_layer_sync_moe_local_experts_amax test that verifies amax synchronization across local MoE experts with grouped and non-grouped configurations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Sync MOE layer input quantizer only' directly and clearly describes the main change: restricting synchronization to input quantizers only, not weight quantizers.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jennifchen/moe_sync_input

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/megatron.py (1)

593-618: Core logic change LGTM; minor redundancy in the apply loop.

Restricting synchronization to name == "input_quantizer" is correct: input activations must be uniform across all experts to avoid EP-distributed-sync deadlocks, while weight amaxes are intentionally left non-uniform for VLLM/TRTLLM.

One nit: in the apply loop (lines 613–617), name in amax_dict and name == "input_quantizer" are redundant — amax_dict can only ever hold "input_quantizer" as a key. The name in amax_dict check implicitly handles the empty-dict (no calibrated amax) case. Consider keeping only the more general and intention-revealing form:

♻️ Optional simplification
         for expert in self.local_experts:
             for name, module in expert.named_modules():
                 if (
                     isinstance(module, TensorQuantizer)
                     and name in amax_dict
-                    and name == "input_quantizer"
                 ):
                     module.amax = amax_dict[name].detach().clone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/megatron.py` around lines 593 - 618, The
apply loop redundantly checks both name == "input_quantizer" and name in
amax_dict; instead simplify by removing the membership check and safely lookup
from amax_dict: inside the second loop over self.local_experts and
expert.named_modules() when isinstance(module, TensorQuantizer) and name ==
"input_quantizer", do amax = amax_dict.get(name) and if amax is not None set
module.amax = amax.detach().clone(); reference symbols: amax_dict,
self.local_experts, TensorQuantizer, name == "input_quantizer", module.amax.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 578-591: Fix two minor docstring nits in the megatron.py docstring
that describes the sync of input quantizer amax across local experts in a
SequentialMLP: insert a missing space after "amax." so it reads "amax. This",
and change "there are logic" to "there is logic" in the Note paragraph; update
the docstring associated with the sync function in
modelopt/torch/quantization/plugins/megatron.py (the function that syncs input
quantizer amax across local experts in a SequentialMLP) accordingly.

In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`:
- Around line 738-768: Add a pytest parametrize decorator and convert this test
into a MP-spawned helper to run with EP=2: create a helper (e.g.
_test_layer_sync_moe_local_experts_amax_helper(rank, size)) and call it via
spawn_multiprocess_job from the top-level test decorated with
`@pytest.mark.parametrize`("moe_grouped_gemm", [True, False]) and using the
need_2_gpus fixture; inside the helper call initialize_for_megatron with
expert_model_parallel_size=2, obtain the forward closure correctly by calling
forward(model) (not forward()), uncomment and run mtq.quantize(...) to perform
calibration so expert.input_quantizer.amax is set, iterate model.named_modules()
and for each module instance of _MegatronSequentialMLP call
module.layer_sync_moe_local_experts_amax() and then inspect module.local_experts
for matching amax values, remove the stray print(model) and ensure
destroy_model_parallel() is called at the end of the helper.

---

Nitpick comments:
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 593-618: The apply loop redundantly checks both name ==
"input_quantizer" and name in amax_dict; instead simplify by removing the
membership check and safely lookup from amax_dict: inside the second loop over
self.local_experts and expert.named_modules() when isinstance(module,
TensorQuantizer) and name == "input_quantizer", do amax = amax_dict.get(name)
and if amax is not None set module.amax = amax.detach().clone(); reference
symbols: amax_dict, self.local_experts, TensorQuantizer, name ==
"input_quantizer", module.amax.

@codecov
Copy link

codecov bot commented Feb 18, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.54%. Comparing base (9e38041) to head (023b0a3).

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #903   +/-   ##
=======================================
  Coverage   73.54%   73.54%           
=======================================
  Files         205      205           
  Lines       22000    22001    +1     
=======================================
+ Hits        16179    16180    +1     
  Misses       5821     5821           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jenchen13 jenchen13 requested a review from ChenhanYu February 18, 2026 19:29
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 force-pushed the jennifchen/moe_sync_input branch from a290003 to 28d5686 Compare February 18, 2026 20:06
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do an else condition test as well (i.e when this is False, the weight_quantizer amax should be different between local_experts.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments