Skip to content

Unable to export FP8 Qwen2.5-1.5B model to ONNX with per-channel quantization. #848

@jianlany

Description

@jianlany

Make sure you already checked the examples and documentation before submitting an issue.

How would you like to use ModelOpt

Export a QAT FP8 Qwen2.5-1.5B classification model with per-channel weight quantization and dynamic input quantization to ONNX format, and ultimately be used by TensorRT. Currently the exporter get_onnx_bytes_and_metadata complains about not supporting per-channel quantization.

It also complains about empty amax for input quantizer, indicating it is not aware of the dynamic quantization for input.

Can you clarify if these features are indeed not supported? If so, any pointers to other paths to infer the model with TensorRT are appreciated!

I have created a minimal case for reproduction here, please give it a try.
https://onedrive.cloud.microsoft/:f:/a@ye2tf6tq/S/IgBdqhatiA5zQKVrh9cwJxb-AWgG9GXSxK8FBlhKGIoxNf8?e=bg2OpW

Who can help?

  • ModelOpt Team

System information

  • Container used (if applicable): Not used
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 22.04
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): H100
  • GPU memory size: 94GB
  • Number of GPUs: 1
  • Library versions (if applicable):
    • Python: 3.11.14
    • ModelOpt version or commit hash: 0.41.0
    • CUDA: ?
    • PyTorch: 2.9.1
    • Transformers: 4.52.4
    • TensorRT-LLM: Not installed
    • ONNXRuntime: 1.22.0
    • TensorRT: Not related

Metadata

Metadata

Assignees

Labels

questionHelp is is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions