-
Notifications
You must be signed in to change notification settings - Fork 276
Description
Make sure you already checked the examples and documentation before submitting an issue.
How would you like to use ModelOpt
Export a QAT FP8 Qwen2.5-1.5B classification model with per-channel weight quantization and dynamic input quantization to ONNX format, and ultimately be used by TensorRT. Currently the exporter get_onnx_bytes_and_metadata complains about not supporting per-channel quantization.
It also complains about empty amax for input quantizer, indicating it is not aware of the dynamic quantization for input.
Can you clarify if these features are indeed not supported? If so, any pointers to other paths to infer the model with TensorRT are appreciated!
I have created a minimal case for reproduction here, please give it a try.
https://onedrive.cloud.microsoft/:f:/a@ye2tf6tq/S/IgBdqhatiA5zQKVrh9cwJxb-AWgG9GXSxK8FBlhKGIoxNf8?e=bg2OpW
Who can help?
- ModelOpt Team
System information
- Container used (if applicable): Not used
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 22.04
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): H100
- GPU memory size: 94GB
- Number of GPUs: 1
- Library versions (if applicable):
- Python: 3.11.14
- ModelOpt version or commit hash: 0.41.0
- CUDA: ?
- PyTorch: 2.9.1
- Transformers: 4.52.4
- TensorRT-LLM: Not installed
- ONNXRuntime: 1.22.0
- TensorRT: Not related