Update dependency cache_dit to v1.3.9#33
Open
renovate[bot] wants to merge 1 commit into
Open
Conversation
9463dfe to
fd0ef19
Compare
fd0ef19 to
c65665d
Compare
c65665d to
376529f
Compare
376529f to
8e12ad4
Compare
8e12ad4 to
90ea7f3
Compare
90ea7f3 to
7f6b76d
Compare
7f6b76d to
ba50475
Compare
ba50475 to
0b1a44a
Compare
0b1a44a to
1ce318c
Compare
1ce318c to
28fa8b8
Compare
28fa8b8 to
9ea5b4b
Compare
9ea5b4b to
302b19a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==1.1.10→==1.3.9Release Notes
vipshop/cache-dit (cache_dit)
v1.3.9Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.8...v1.3.9
v1.3.8Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.7...v1.3.8
v1.3.7Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.6...v1.3.7
v1.3.6Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.5...v1.3.6
v1.3.5: QuantizationCompare Source
Low-bits Quantization
Overview
Quantization is a powerful technique to reduce the memory footprint and computational cost of deep learning models by representing weights and activations with lower precision data types. Cache-DiT supports various quantization methods, including FP8, INT8, and INT4 quantization, to help users achieve faster inference and lower memory usage while maintaining acceptable model performance.
FP8 Quantization
Currently, TorchAo has been fully integrated into Cache-DiT as the backend for online quantization. You can implement model quantization by calling quantize or pass a QuantizeConfig to enable_cache API. (recommended)
For GPUs with low memory capacity, we recommend using float8_per_row or float8_per_block, as these methods cause almost no loss in precision. Supported quantization types including:
Here are some examples of how to use quantization with cache-dit. You can directly specify the quantization config in the enable_cache API.
Users can also specify different quantization configs for different components. For example, quantize the transformer to float8_per_row and the text encoder to float8_weight_only.
Or, directly call the quantize API for more fine-grained control.
Please also enable torch.compile for better performance with quantization.
Users can set exclude_layers in QuantizeConfig to exclude some sensitive layers that are not robust to quantization, e.g., embedding layers. Layers that contain any of the keywords in the exclude_layers list will be excluded from quantization. For example:
By default, quant_type="float8_per_row" for better precision. Users can set it to "float8_per_tensor" to use per-tensor quantization for better performance on some hardware.
Regional Quantization
Cache-DiT also supports regional quantization, which allows users to quantize only the repeated blocks in a transformer. This can be useful for better balancing the precision and efficiency. Users can specify the blocks to be quantized via the regional_quantize and repeated_blocks arguments in QuantizeConfig. For example, to quantize repeated blocks of the Flux2's transformer:
FP8 Per-Tensor Fallback
The per_tensor_fallback option in Cache-DiT's quantization configuration allows users to enable a fallback mechanism for layers that do not support float8 per-row or per-block quantization. This is particularly useful in scenarios where tensor parallelism is applied, and certain layers (e.g., those applied with RowwiseParallel) may encounter memory layout mismatch errors when quantized to float8 per-row.
When per_tensor_fallback is set to True, if a layer cannot be quantized to float8 per-row or per-block, it will automatically fall back to float8 per-tensor quantization instead of raising an error. This ensures that the quantization process can continue smoothly without interruption, while still providing the benefits of reduced precision for supported layers.
To enable this feature, simply set the per_tensor_fallback flag to True (default) in the QuantizeConfig when calling the enable_cache API. Only support for float8 quantization for now. For example:
For examples, without fp8 per-tensor fallback, the cache-dit will auto skip the layers that do not support float8 per-row quantization, and raise warning for those layers. The performance will be worse due to less layers being quantized. (quantize 88 layers, skip 56 layers)
# w/o fp8 per-tensor fallback, quantize 88 layers, skip 56 layers, performance downgrade. torchrun --nproc_per_node=2 -m cache_dit.generate flux2_klein_9b_kv_edit \ --parallel tp --compile --float8-per-row --q-verbose \ --disable-per-tensor-fallbackWith fp8 per-tensor fallback enabled, those layers that do not support float8 per-row quantization will be quantized to float8 per-tensor instead, and the performance will be better due to more layers being quantized. (quantize 144 layers, skip 0 layer)
# w/ fp8 per-tensor fallback enabled, quantize 144 layers, skip 0 layer, better performance. torchrun --nproc_per_node=2 -m cache_dit.generate flux2_klein_9b_kv_edit \ --parallel tp --compile --float8-per-row --q-verbose(Hybrid) Precision Plan
The precision_plan option in QuantizeConfig allows users to specify different quantization types for matched layer-name patterns. It is useful when you want better control of the accuracy and performance trade-off for attention sub-layers (for example, keep to_k/to_v in float8_per_row while using float8_per_tensor for to_q/to_out). Please note:
For example: (FLUX.2-Klein-9b-kv)
Then, the output summary will show the quantization type for each layer, and users can verify the quantization plan is applied correctly.
INT8/INT4 Quantization
In addition to FP8 quantization, Cache-DiT also supports INT8 and INT4 quantization for weights, which can further reduce the memory footprint of the model. Users can specify int8_per_row, int8_per_tensor, int8_weight_only, or int4_weight_only as the quantization type in the QuantizeConfig when calling the enable_cache API. For example:
INT4 quantization can provide even better memory reduction compared to FP8 or INT8, but it may cause more precision loss. We recommend users to try different quantization types and choose the one that best fits their needs in terms of the trade-off between performance and precision. In most cases, float8 per-row can be a good choice for better memory reduction while maintaining acceptable precision.
Please note that users should also install mslk kernel library to enable INT8/INT4 quantization features. The int4_weight_only w4a16 compute kennel requires architectures >= sm90 (Hopper or newer, TMA required). For older architectures, users can use int8_weight_only quantization for better compatibility.
In the case of distributed inference (context parallelism or tensor parallelism), we recommend users to use float8 quantization to avoid potential compatibility issues.
Nunchaku (W4A4)
Cache-DiT natively supports the Hybrid Cache + Nunchaku + Context Parallelism scheme. Users can leverage caching and context parallelism to speed up Nunchaku 4-bits W4A4 models.
v1.3.4Compare Source
hotfix
v1.3.3Compare Source
hotfix
v1.3.2Compare Source
hotfix release for fp8 per-row quantization w/ tensor parallel
Full Changelog: vipshop/cache-dit@v1.3.1...v1.3.2
v1.3.1Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.3.0...v1.3.1
v1.3.0: : USP, 2D/3D Parallel, FP8 Blockwise, ...Compare Source
v1.3.0 Major Release: USP, 2D/3D Parallel, FP8 Blockwise, ...
Cache-DiT v1.3.0 is a major release after v.1.2.0, the major changes incuding:
enable_cacheAPIFull Changelog: vipshop/cache-dit@v1.2.0...v1.3.0
v1.2.3Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.2.2...v1.2.3
v1.2.2Compare Source
What's Changed
Full Changelog: vipshop/cache-dit@v1.2.1...v1.2.2
v1.2.1: USP, 2D/3D ParallelCompare Source
🎉 v1.2.1 release is ready, the major updates including: Ring Attention w/ batched P2P, USP (Hybrid Ring and Ulysses), Hybrid 2D and 3D Parallelism (💥USP + TP), VAE-P Comm overhead reduce.
What's Changed
Configuration
📅 Schedule: (UTC)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.