Skip to content

[Feature Request] Dynamic padding to nearest multiple of 75 (e.g., 75/150.../750) for large token limits #2256

@TLFZ1

Description

@TLFZ1

When training with large max_token_length values (e.g., 225, 375, 525, or even 750), the script currently pads all captions to the maximum defined length. For example, if I set max_token_length=750 but a batch only contains short captions (~100 tokens), padding them all to 750 (10 chunks) is extremely inefficient and causes OOM errors on limited VRAM.

I request a generalized dynamic padding feature that pads the batch to the nearest multiple of 75 based on the longest prompt in the current batch.

Logic:Let $L$ be the max token length in the current batch.The target padding length should be: $\lceil L / 75 \rceil \times 75$.

Examples:
Max length $\le$ 75 $\rightarrow$ Pad to 75.
Max length 100 $\rightarrow$ Pad to 150 (2 chunks).
Max length 300 $\rightarrow$ Pad to 300 (4 chunks).
Max length 700 $\rightarrow$ Pad to 750 (10 chunks).

This logic scales correctly for any max_token_length (e.g., 225, 375, 750) and ensures we always respect the CLIP model structure while maximizing VRAM efficiency.

I tried --no_token_padding, but it breaks the 75-token chunk structure required by CLIP (e.g., passing 80 tokens directly instead of 150), which causes quality degradation. Hardcoding fixed lengths like 225 is not flexible enough for longer context training.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions