INT8 quantized Flux.2 Klein 9B shows no significant speedup compared to BF16 on RTX 5060 Ti

## My Environment
- **GPU:** NVIDIA RTX 5060 Ti (Blackwell architecture)
- **OS:** Ubuntu 24.04
- **PyTorch:** 2.10
- **CUDA:** 13.0
- **Model:** Flux.2 Klein 9B
- **Quantization config:** batch size 128, INT8

## Issue Description
After quantizing Flux.2 Klein 9B to INT8 using the convert_to_quant tools, I observed the inference speed is not fast as expected(2x compare to BF16).

<img width="630" height="1154" alt="Image" src="https://github.com/user-attachments/assets/32b016ca-f4d2-42c2-bee7-44d2e11ef3be" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT8 quantized Flux.2 Klein 9B shows no significant speedup compared to BF16 on RTX 5060 Ti #12

My Environment

Issue Description

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

INT8 quantized Flux.2 Klein 9B shows no significant speedup compared to BF16 on RTX 5060 Ti #12

Description

My Environment

Issue Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions