Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis change restricts Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/megatron.py (1)
593-618: Core logic change LGTM; minor redundancy in the apply loop.Restricting synchronization to
name == "input_quantizer"is correct: input activations must be uniform across all experts to avoid EP-distributed-sync deadlocks, while weight amaxes are intentionally left non-uniform for VLLM/TRTLLM.One nit: in the apply loop (lines 613–617),
name in amax_dictandname == "input_quantizer"are redundant —amax_dictcan only ever hold"input_quantizer"as a key. Thename in amax_dictcheck implicitly handles the empty-dict (no calibrated amax) case. Consider keeping only the more general and intention-revealing form:♻️ Optional simplification
for expert in self.local_experts: for name, module in expert.named_modules(): if ( isinstance(module, TensorQuantizer) and name in amax_dict - and name == "input_quantizer" ): module.amax = amax_dict[name].detach().clone()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/megatron.py` around lines 593 - 618, The apply loop redundantly checks both name == "input_quantizer" and name in amax_dict; instead simplify by removing the membership check and safely lookup from amax_dict: inside the second loop over self.local_experts and expert.named_modules() when isinstance(module, TensorQuantizer) and name == "input_quantizer", do amax = amax_dict.get(name) and if amax is not None set module.amax = amax.detach().clone(); reference symbols: amax_dict, self.local_experts, TensorQuantizer, name == "input_quantizer", module.amax.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 578-591: Fix two minor docstring nits in the megatron.py docstring
that describes the sync of input quantizer amax across local experts in a
SequentialMLP: insert a missing space after "amax." so it reads "amax. This",
and change "there are logic" to "there is logic" in the Note paragraph; update
the docstring associated with the sync function in
modelopt/torch/quantization/plugins/megatron.py (the function that syncs input
quantizer amax across local experts in a SequentialMLP) accordingly.
In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`:
- Around line 738-768: Add a pytest parametrize decorator and convert this test
into a MP-spawned helper to run with EP=2: create a helper (e.g.
_test_layer_sync_moe_local_experts_amax_helper(rank, size)) and call it via
spawn_multiprocess_job from the top-level test decorated with
`@pytest.mark.parametrize`("moe_grouped_gemm", [True, False]) and using the
need_2_gpus fixture; inside the helper call initialize_for_megatron with
expert_model_parallel_size=2, obtain the forward closure correctly by calling
forward(model) (not forward()), uncomment and run mtq.quantize(...) to perform
calibration so expert.input_quantizer.amax is set, iterate model.named_modules()
and for each module instance of _MegatronSequentialMLP call
module.layer_sync_moe_local_experts_amax() and then inspect module.local_experts
for matching amax values, remove the stray print(model) and ensure
destroy_model_parallel() is called at the end of the helper.
---
Nitpick comments:
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 593-618: The apply loop redundantly checks both name ==
"input_quantizer" and name in amax_dict; instead simplify by removing the
membership check and safely lookup from amax_dict: inside the second loop over
self.local_experts and expert.named_modules() when isinstance(module,
TensorQuantizer) and name == "input_quantizer", do amax = amax_dict.get(name)
and if amax is not None set module.amax = amax.detach().clone(); reference
symbols: amax_dict, self.local_experts, TensorQuantizer, name ==
"input_quantizer", module.amax.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #903 +/- ##
=======================================
Coverage 73.54% 73.54%
=======================================
Files 205 205
Lines 22000 22001 +1
=======================================
+ Hits 16179 16180 +1
Misses 5821 5821 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
a290003 to
28d5686
Compare
There was a problem hiding this comment.
can we do an else condition test as well (i.e when this is False, the weight_quantizer amax should be different between local_experts.
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
What does this PR do?
Type of change: Bug fix
Overview: in MOE layer we currently sync both the weight and input quantizers so that all experts have the same weight amaxes and activation amaxes.
VLLM/TRTLLM actually support non-uniform weight amaxes in MOE so we only need to sync the activation amaxes.
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Bug Fixes
Documentation
Tests