When training with large max_token_length values (e.g., 225, 375, 525, or even 750), the script currently pads all captions to the maximum defined length. For example, if I set max_token_length=750 but a batch only contains short captions (~100 tokens), padding them all to 750 (10 chunks) is extremely inefficient and causes OOM errors on limited VRAM.
I request a generalized dynamic padding feature that pads the batch to the nearest multiple of 75 based on the longest prompt in the current batch.
Logic:Let $L$ be the max token length in the current batch.The target padding length should be: $\lceil L / 75 \rceil \times 75$.
Examples:
Max length $\le$ 75 $\rightarrow$ Pad to 75.
Max length 100 $\rightarrow$ Pad to 150 (2 chunks).
Max length 300 $\rightarrow$ Pad to 300 (4 chunks).
Max length 700 $\rightarrow$ Pad to 750 (10 chunks).
This logic scales correctly for any max_token_length (e.g., 225, 375, 750) and ensures we always respect the CLIP model structure while maximizing VRAM efficiency.
I tried --no_token_padding, but it breaks the 75-token chunk structure required by CLIP (e.g., passing 80 tokens directly instead of 150), which causes quality degradation. Hardcoding fixed lengths like 225 is not flexible enough for longer context training.
When training with large max_token_length values (e.g., 225, 375, 525, or even 750), the script currently pads all captions to the maximum defined length. For example, if I set max_token_length=750 but a batch only contains short captions (~100 tokens), padding them all to 750 (10 chunks) is extremely inefficient and causes OOM errors on limited VRAM.
I request a generalized dynamic padding feature that pads the batch to the nearest multiple of 75 based on the longest prompt in the current batch.
Logic:Let$L$ be the max token length in the current batch.The target padding length should be: $\lceil L / 75 \rceil \times 75$ .
Examples:$\le$ 75 $\rightarrow$ Pad to 75.$\rightarrow$ Pad to 150 (2 chunks).$\rightarrow$ Pad to 300 (4 chunks).$\rightarrow$ Pad to 750 (10 chunks).
Max length
Max length 100
Max length 300
Max length 700
This logic scales correctly for any max_token_length (e.g., 225, 375, 750) and ensures we always respect the CLIP model structure while maximizing VRAM efficiency.
I tried --no_token_padding, but it breaks the 75-token chunk structure required by CLIP (e.g., passing 80 tokens directly instead of 150), which causes quality degradation. Hardcoding fixed lengths like 225 is not flexible enough for longer context training.