-
Notifications
You must be signed in to change notification settings - Fork 278
Description
When attempting to quantize Qwen3-Next-80B-A3B-Instruct using the HF PTQ example with INT4 AWQ quantization, the calibration process appears to complete successfully, but the post-quantization generation step fails with a RuntimeError indicating that the probability tensor contains invalid values (inf, nan, or negative elements).
The error occurs during the post_quantize() function's preview generation call, specifically when torch.multinomial() attempts to sample from the model's output probabilities.
Environment
-
nvidia-modelopt: 0.41.0
-
PyTorch: 2.9.0a0+145a3a7bda.nv25.10
-
CUDA: 13.0
-
cuDNN: 91400
-
transformers: 4.57.6
-
tensorrt: 10.13.3.9
-
tensorrt_llm: 1.2.0rc6
-
accelerate: 1.12.0
-
Python: 3.12.3
-
GPU: NVIDIA A100 80GB PCIe (compute capability 8.0)
-
OS: Ubuntu 24.04.3 LTS (running in Docker container)
-
ModelOpt Git commit: d39cf45
Model Details
-
Model: Qwen3-Next-80B-A3B-Instruct
-
Architecture: Sparse MoE with 512 experts, 10 experts per token
-
Layers: 48
-
Special features: Sparse attention with decoder_sparse_step: 1 and full_attention_interval: 4
-
Source: Local model directory (originally from Hugging Face)
Reproduction Steps
-
Clone/install NVIDIA Model Optimizer (commit d39cf45)
-
Download Qwen3-Next-80B-A3B-Instruct model weights
-
Run the following command:
cd TensorRT-Model-Optimizer/examples/llm_ptq/scripts
./huggingface_example.sh \
--model path/to/my/model/Qwen3-Next-80B-A3B-Instruct/ \
--quant int4_awq \
--calib 32 \
--calib_batch_size 8 \
--batch 8 \
--calib_dataset cnn_dailymailExpected Behavior
The post-quantization generation should complete successfully, producing text output (even if quality is degraded compared to the full-precision model). The quantized model should not produce inf or nan values in its logits/probabilities. Due to hardware constraints we have modified huggingface_example.sh to pass the --device=cpu argument to hf_ptq.py. The server we are using only has 80GB of VRAM via a single Nvidia A100, however the CPU has access to 216 GB of system RAM.
Actual Behavior
The quantization process appears to complete calibration, but fails during the post-quantization preview generation step with the following traceback:
Traceback (most recent call last):
File "/storage/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1025, in <module>
main(args)
File "/storage/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 1004, in main
quantize_main(
File "/storage/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 797, in quantize_main
post_quantize(
File "/storage/TensorRT-Model-Optimizer/examples/llm_ptq/hf_ptq.py", line 623, in post_quantize
generated_ids_after_ptq = full_model.generate(preview_input_ids, max_new_tokens=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2566, in generate
result = decoding_method(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformers/generation/utils.py", line 2831, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
######## GPU 0: Peak memory usage = 1.17 GB for all processes on the GPU ########Additional Context
-
The relatively low peak memory usage (1.17 GB) suggests the model loaded and ran through calibration, but the quantized inference is producing numerically invalid outputs.
-
This is a production use case where INT4 quantization is needed to fit the 80B MoE model efficiently on available hardware.
-
I'm happy to provide additional debugging information, test alternative configurations, or run instrumented versions of the code if that would help diagnose the root cause.
Workarounds Attempted
As previously stated, we are quantizing on the CPU to circumvent the hardware problem.
We have arrived at this situation as a result of numerous attempted workarounds involving modifying code in both ModelOpt and TensorRT-LLM in a handful of places in order to attempt to force quantization to be completed. We are not sure where to go from here.