[Feature Request] Dynamic padding to nearest multiple of 75 (e.g., 75/150.../750) for large token limits

When training with large max_token_length values (e.g., 225, 375, 525, or even 750), the script currently pads all captions to the maximum defined length. For example, if I set max_token_length=750 but a batch only contains short captions (~100 tokens), padding them all to 750 (10 chunks) is extremely inefficient and causes OOM errors on limited VRAM.

I request a generalized dynamic padding feature that pads the batch to the nearest multiple of 75 based on the longest prompt in the current batch.

Logic:Let $L$ be the max token length in the current batch.The target padding length should be: $\lceil L / 75 \rceil \times 75$.

Examples:
Max length $\le$ 75 $\rightarrow$ Pad to 75.
Max length 100 $\rightarrow$ Pad to 150 (2 chunks).
Max length 300 $\rightarrow$ Pad to 300 (4 chunks).
Max length 700 $\rightarrow$ Pad to 750 (10 chunks).

This logic scales correctly for any max_token_length (e.g., 225, 375, 750) and ensures we always respect the CLIP model structure while maximizing VRAM efficiency.

I tried --no_token_padding, but it breaks the 75-token chunk structure required by CLIP (e.g., passing 80 tokens directly instead of 150), which causes quality degradation. Hardcoding fixed lengths like 225 is not flexible enough for longer context training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Dynamic padding to nearest multiple of 75 (e.g., 75/150.../750) for large token limits #2256

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Feature Request] Dynamic padding to nearest multiple of 75 (e.g., 75/150.../750) for large token limits #2256

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions