Refactor quantization parameters and methods for text encoder and DiT models#1128
Refactor quantization parameters and methods for text encoder and DiT models#1128KyleShao1016 wants to merge 2 commits intohao-ai-lab:mainfrom
Conversation
Summary of ChangesHello @KyleShao1016, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly overhauls the FP8 quantization framework by adopting a modular, vLLM-inspired architecture. The primary goal is to enhance the maintainability and extensibility of quantization methods, making it easier to introduce new features like static activation quantization or different kernel backends. The changes involve a clear separation of configuration, method implementation, and kernel dispatch, alongside standardizing naming conventions and improving integration with model loading processes, particularly for DiT models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring of the FP8 quantization logic, aligning it with vLLM's Config -> Method -> Kernel architecture. This greatly improves modularity, maintainability, and extensibility for future quantization work. The changes are comprehensive, including new abstractions for configuration, methods, and kernels, as well as thorough test coverage for the new components. The backward compatibility for existing flags and functions is also a nice touch. I have one suggestion to refactor a small piece of duplicated code to further improve maintainability.
… models - Renamed `override_text_encoder_quant` to `text_encoder_quantization` for consistency. - Introduced `dit_quantization` parameter for DiT model quantization. - Updated argument parsing to reflect new parameter names. - Enhanced quantization methods to support FP8 and added new utility functions for quantization. - Implemented a bridge for injecting FP8 parameters into nn.Linear modules for DiT models. - Added tests for new quantization methods and configurations. This refactor aims to streamline the quantization process and improve compatibility with FP8 compute.
- Introduced new parameters for FP8 activation granularity and ignored layers in FastVideoArgs. - Added functionality to detect FP8 weights from safetensors and configure models accordingly. - Enhanced the quantization methods to support both offline and online FP8 processing. - Updated the AbsMaxFP8Config to delegate quantization methods to the new Fp8LinearMethod. - Added tests to validate the new FP8 configurations and ensure correct behavior. This update aims to improve the flexibility and performance of quantization in models using FP8.
d3c9487 to
a54bb22
Compare
Motivation
FastVideo's existing FP8 path (
absmax_fp8.py) works but is monolithic — quantization config,weight handling, and kernel dispatch are all tangled together. This makes it hard to extend
(e.g., adding static activation quantization, new kernel backends, or per-layer granularity).
This PR refactors the FP8 quantization layer to follow vLLM's
Config → Method → Kernelarchitecture, which cleanly separates concerns and makes future extensions straightforward.
What this PR does
New architecture (vLLM-aligned)
Fp8Config— declares quantization settings (activation type, weight dtype, ignored layers)Fp8LinearMethod/Fp8OnlineLinearMethod— handle weight creation, loading, and forward dispatchFP8ScaledMMLinearKernel— abstract kernel interface with a PyTorch (torch._scaled_mm) backendQuantFP8— activation quantization (dynamic per-tensor scaling)QuantKey/GroupShape/ScaleDesc— granularity descriptors for future extensibilityRefactoring of existing code
scan_fp8_modules,prepare_model_for_fp8) fromabsmax_fp8.pyintodit_fp8_bridge.py— these inject FP8 into plainnn.Linearmodulesthat don't use
LinearBasesupports_fp8_compute,quantize_input_dynamic, etc.) intofp8_utils.pyabsmax_fp8.pyre-exports bridge functions for backward compatibilityNaming standardization
scale_weight/scale_inputtoweight_scale/input_scaleto match checkpoint format and vLLM convention
ltx2.pythat were papering over the mismatchCLI & loader integration
--override-dit-quant→--dit-quantization,--override-text-encoder-quant→--text-encoder-quantization(old flags kept as backward-compatible aliases)component_loader.pynow uses the quantization registry to instantiate any registered methodNew files
fp8.pyFp8Config,Fp8LinearMethod,Fp8OnlineLinearMethodfp8_utils.pyinput_quant_fp8.pyQuantFP8activation quantizationdit_fp8_bridge.pynn.Linear(DiT models)kernels/scaled_mm/utils/quant_utils.pyQuantKey,GroupShape,ScaleDescdescriptorstests/ops/quantization/test_fp8.pyTest plan
test_fp8.py— kernel correctness, config/method integration, online quantization vs bf16 referencetest_absmax_fp8.py— existing tests pass with renamed scale parametersNext Steps
Fp8LinearMethodintoLinearBase.forwardfor end-to-end inference