AUDIO OPTIMIZATIONS AMONGST OTHER THINGS by ArtDesignAwesome · Pull Request #720 · ostris/ai-toolkit

ArtDesignAwesome · 2026-02-21T01:52:46Z

I WAS GETTING CHEWED OUT BY THE COMMUNITY FOR CREATING MY OWN FORK, NO DISRESPECT MEANT. HERE ARE MY CHANGES:

LTX-2 Voice Training Fix for Ostris AI-Toolkit

Voice training for LTX-2 character LoRAs is broken out of the box. This patch fixes it.

If you've tried training an LTX-2 character LoRA and your output has garbled, silent, or completely wrong audio — this is why, and this is the fix.

The Problem

LTX-2 is a joint audio+video diffusion transformer. When you train a character LoRA, the model should learn both the person's appearance AND their voice. In practice, every single person training LTX-2 character LoRAs in ostris/ai-toolkit gets broken audio. The LoRA produces correct visuals but the voice is destroyed.

This isn't a settings issue. There are 25 bugs and design flaws in the training pipeline that collectively make voice training impossible.

What Was Wrong (The Big Ones)

1. Audio and video shared the same timestep during training

LTX-2's transformer has completely separate timestep processing for audio (audio_adaln_single) and video (adaln_single). The training code sampled ONE random timestep and fed it to BOTH. This means audio never explored its own noise schedule — voice learning was fundamentally broken at the architecture level.

Fix: Independent audio timestep sampling. Each training step now samples a separate random timestep for audio. Zero computational cost. Massive quality impact.

2. Audio was never actually extracted from your videos

On Windows and Pinokio installs, torchaudio.load() is completely broken. The newer versions use torchcodec as a backend, which requires FFmpeg shared libraries that Pinokio doesn't install. The error was silently caught, so training appeared to work but every single video had its audio silently dropped.

Fix: Multi-fallback audio extraction that tries torchaudio first, then PyAV (which bundles its own FFmpeg and is already in requirements.txt), then ffmpeg CLI as a last resort. Audio extraction now works on every platform.

3. Stale latent cache had no audio in it

If you ever ran training before this fix, your latent cache files don't contain audio_latent. The cache loader only checked if the file existed, not whether it contained audio. So even after fixing audio extraction, the old cache was still being used — with no audio.

Fix: Cache validation now checks for the audio_latent key inside safetensors files. Missing audio = cache invalidated and re-encoded.

4. Video loss drowned out audio loss

Video loss magnitude is much larger than audio loss. Without balancing, the optimizer effectively ignores audio gradients entirely.

Fix: EMA-based dynamic audio loss balancing that targets audio at ~33% of total loss. Adapts automatically over training.

5. Audio-free batches caused voice forgetting

When a batch had no audio (e.g., image-only batches in mixed datasets), zero audio loss was computed. This let the LoRA drift away from the base model's voice generation ability — catastrophic forgetting.

Fix: Synthetic silence regularizer for audio-free batches. Forces the LoRA to preserve base audio behavior even on batches without audio data.

6. DoRA and quantized models crashed immediately

Using DoRA (weight-decomposed LoRA) with quantized models (qfloat8) caused crashes from AffineQuantizedTensor dispatch errors, dtype mismatches in SDPA attention, and derivative for dequantize is not implemented errors in the backward pass. The root causes were spread across 6 different files.

Fix: Precise quantization type detection, safe forward wrappers, dtype enforcement in the memory manager, and SDPA attention mask safety nets. DoRA + quantization + layer offloading now works end to end.

7. Auto-balance audio loss was broken — could only boost, never dampen

The dynamic audio loss balancer was designed to keep audio and video loss in proportion. But the multiplier was clamped at minimum 1.0, meaning it could only INCREASE audio loss. In practice, LTX-2's raw audio loss is naturally ~2x larger than video loss. The computed multiplier (~0.18) was always clamped back to 1.0. The feature was dead code — dyn_mult showed 1.00 for the entire training run.

Fix: Changed the clamp floor from 1.0 to 0.05. The multiplier now works bidirectionally — dampening audio when it's already dominant (common case), boosting when it's too small. dyn_mult actively adjusts throughout training.

8. Multiple config/runtime crashes on LTX-2

self.train_config accessed on the model object instead of the trainer — crash on step 0
min_snr_gamma incompatible with flow-matching schedulers — crash on loss calculation
print_and_status_update called on wrong object — crash on audio logging
noise_offset rejected 5D video tensors — crash when enabled
torch.compile pointed at unet instead of transformer — no effect on DiT models

All fixed.

What We Added

Core Audio Fixes

Independent audio timestep sampling (the single biggest voice quality improvement)
Bidirectional automatic audio loss balancing (EMA-based, dampens when audio > video, boosts when video > audio)
Robust multi-fallback audio extraction (torchaudio -> PyAV -> ffmpeg CLI)
Latent cache audio validation and automatic invalidation
Voice preservation regularizer for audio-free batches
Connector gradient unfreezing for text-to-audio adaptation
Scalar audio loss (saves VRAM vs storing full prediction tensors)
Audio loss logging every 10 steps

Quality Improvements

Rank/module dropout support (prevents overfitting on small datasets)
Higher default rank (32 instead of 4 — needed for joint audio+video identity)
DoRA and LoKr network type support
Gradient checkpointing wired for LTX-2 (30-50% VRAM savings)
Flash/memory-efficient attention enforcement
Video-safe noise offset
Caption dropout support
Cosine annealing / cosine with restarts / linear decay LR schedulers
Prodigy and DAdaptation optimizer support

Compatibility Fixes

DoRA + TorchAO quantization (qfloat8) fully working
Layer offloading + quantization dtype enforcement
SDPA attention mask dtype safety for mixed-precision paths
Min-SNR guard for flow-matching schedulers
torch.compile targeting transformer instead of unet
All train_config access made safe for model vs trainer context
Precise quantization type detection (fixes PyTorch 2.9+ dequantize issue)

Installation

Option 1: Copy Modified Files (Recommended)

Copy these files from the release into your ai-toolkit installation, replacing the originals:

toolkit/config_modules.py
toolkit/train_tools.py
toolkit/dataloader_mixins.py
toolkit/data_transfer_object/data_loader.py
toolkit/models/DoRA.py
toolkit/network_mixins.py
toolkit/memory_management/manager_modules.py
extensions_built_in/diffusion_models/ltx2/ltx2.py
extensions_built_in/sd_trainer/SDTrainer.py
jobs/process/BaseSDTrainProcess.py
ui/src/types.ts
ui/src/app/jobs/new/options.ts
ui/src/app/jobs/new/SimpleJob.tsx
ui/src/docs.tsx

Option 2: Apply Patches

Patch files are included in the release zip for those who prefer git apply.

Important: Delete Your Latent Cache

If you've trained before, your cached latents don't have audio in them. Delete your latent cache folder and let it re-encode. The new code will extract audio properly via PyAV and include audio_latent in the cache files.

Recommended Config

LoRA (Recommended — Fast + High Quality)

job: extension
config:
  name: "LTX2_character_lora"
  process:
    - type: "sd_trainer"
      training_folder: "output/my_character"
      device: cuda:0
      trigger_word: "ohwx"
      network:
        type: "lora"
        linear: 32
        linear_alpha: 32
        rank_dropout: 0.1
      save:
        dtype: bf16
        save_every: 250
      datasets:
        - folder_path: "/path/to/your/video/clips"
          num_frames: 81
          do_audio: true
          caption_ext: "txt"
          caption_dropout_rate: 0.05
      train:
        batch_size: 1
        steps: 3000
        gradient_accumulation_steps: 1
        optimizer: "adamw8bit"
        lr: 1e-4
        dtype: bf16
        noise_scheduler: "custom_flowmatch"
        timestep_type: "sigmoid"
        auto_balance_audio_loss: true
        independent_audio_timestep: true
        noise_offset: 0.05
        min_snr_gamma: 0
      model:
        name_or_path: "Lightricks/LTX-2"
        low_vram: true
        quantize: "qfloat8"
      sample:
        sample_every: 250
        width: 512
        height: 320
        num_frames: 81
        sample_steps: 25
        guidance_scale: 3.0
        prompts:
          - "ohwx speaking to the camera about artificial intelligence"

Layer Offloading (If You Need It)

If you're running out of VRAM, add to the model section:

      model:
        layer_offloading: true
        layer_offloading_transformer_percent: 0.56

This offloads 56% of transformer layers to CPU. Expect ~20s/it on RTX 5090 with LoRA. Increase the percentage if you still OOM, decrease for speed.

Note: torch.compile is incompatible with layer_offloading. Don't use both.

DoRA Variant (Higher Quality, Slower, More VRAM)

Replace the network section:

      network:
        type: "dora"
        linear: 32
        linear_alpha: 32
        rank_dropout: 0.1

DoRA decomposes weight updates into magnitude and direction components, which can produce higher quality results. However, it requires significantly more VRAM (you'll need higher layer offloading %, which slows training). For most users, LoRA rank 32 is the sweet spot.

How to Verify Audio Is Training

During training, look for this line in your console output:

[audio] raw=0.01234, scaled=0.05678, video=1.23456

If you see it, audio loss is actively being computed and your character's voice is being learned. If you don't see it, something is wrong with your audio pipeline — check that:

Your video files actually contain audio tracks
do_audio: true is set in your dataset config
You deleted your old latent cache and let it re-encode

After caching, you should see:

Audio latent caching: 24 encoded, 0 failed

(Where 24 is your number of video clips.)

Performance Expectations

GPU	Layer Offload	Network	Expected Speed
RTX 5090 (32GB)	None	LoRA rank 32	~2-5 it/s
RTX 5090 (32GB)	40%	LoRA rank 32	~20s/it
RTX 5090 (32GB)	56%	LoRA rank 32	~30s/it
RTX 5090 (32GB)	56%	DoRA rank 32	~60s/it
RTX 4090 (24GB)	50-60%	LoRA rank 32	~30-45s/it

Training speed is dominated by layer offloading (CPU-GPU memory transfer over PCIe). Reduce offloading percentage for speed, increase for VRAM savings.

Dataset Tips

20-50 video clips of your character speaking, 4-8 seconds each
Clips should have clear audio — the character's voice should be the dominant sound
Variety in lighting, angles, and expressions helps generalization
Caption each clip accurately, describing the visual content
Use a distinctive trigger word (e.g., ohwx) that doesn't conflict with the base model vocabulary
flip_x: true doubles your effective dataset size (don't use for text-heavy content)

FAQ

Q: Do I need to do anything special for audio?
A: Just set do_audio: true in your dataset config and make sure your video files have audio tracks. Everything else is automatic.

Q: Can I use my existing video dataset?
A: Yes, as long as the videos have audio. Delete your old latent cache first so it re-encodes with audio.

Q: LoRA or DoRA?
A: LoRA rank 32 for most users. It's 3x faster and uses significantly less VRAM. DoRA may produce marginally higher quality but requires much more memory and time.

Q: What about LoKr?
A: Supported but less tested with the audio fixes. LoRA is recommended.

Q: My training shows 0 audio loss / no audio line in logs?
A: Your audio isn't being extracted. Delete latent cache, confirm videos have audio, confirm do_audio: true.

Q: Can I use torch.compile?
A: Only if you're NOT using layer_offloading. They're mutually exclusive due to how layer offloading mutates GPU buffers during forward passes.

Q: What's independent_audio_timestep?
A: The single most important fix. LTX-2's transformer processes audio and video noise schedules independently, but the training code was feeding the same random timestep to both. This decouples them so audio can learn at its own optimal noise level. Always leave this true.

Q: Why is min_snr_gamma set to 0?
A: Min-SNR loss weighting requires alphas_cumprod from DDPM-style schedulers. LTX-2 uses a flow-matching scheduler that doesn't have this. Setting it to anything > 0 would crash. The timestep_type: sigmoid or weighted setting provides equivalent loss balancing for flow-matching models.

Files Modified (16 files)

File	What Changed
`toolkit/config_modules.py`	New config fields for audio, dropout, timestep, schedulers
`toolkit/train_tools.py`	Video-safe noise offset (4D + 5D tensor support)
`toolkit/dataloader_mixins.py`	Robust audio extraction, cache validation, PyAV fallback
`toolkit/data_transfer_object/data_loader.py`	Mixed-batch audio handling, audio_loss field
`toolkit/models/DoRA.py`	TorchAO quantization support, precise type checks
`toolkit/network_mixins.py`	Safe forward wrapper, dtype preservation, quantization fixes
`toolkit/memory_management/manager_modules.py`	Layer offloading dtype enforcement
`extensions_built_in/diffusion_models/ltx2/ltx2.py`	Independent audio timestep, gradient checkpointing, SDPA fixes, attention mask safety
`extensions_built_in/sd_trainer/SDTrainer.py`	Audio loss pipeline, auto-balancing, flow-matching guard
`jobs/process/BaseSDTrainProcess.py`	Dropout passthrough, compile fix, dynamic noise offset
`ui/src/types.ts`	TypeScript types for new config fields
`ui/src/app/jobs/new/options.ts`	LTX-2 defaults (rank 32, auto-balance, etc.)
`ui/src/app/jobs/new/SimpleJob.tsx`	UI controls for all new features
`ui/src/docs.tsx`	Documentation for all new fields

Credits

Built on top of Ostris AI-Toolkit. All changes are backward compatible — old configs without new keys work identically to before.

25 bugs identified and fixed. Zero new dependencies added. All features use existing PyTorch, torchaudio, diffusers, and PyAV APIs.

Voice training for LTX-2 character LoRAs was completely broken. Audio was silently dropped during extraction (torchaudio/torchcodec failure on Windows/Pinokio), cached latents had no audio data, video and audio shared a single timestep despite the transformer having independent processing paths, and video loss drowned out audio gradients entirely. Core fixes: - Independent audio timestep sampling (decoupled audio/video noise schedules) - Multi-fallback audio extraction (torchaudio → PyAV → ffmpeg CLI) - Latent cache audio validation and automatic invalidation - EMA-based dynamic audio loss balancing - Voice preservation regularizer for audio-free batches - Connector gradient unfreezing for text-to-audio adaptation Compatibility fixes: - DoRA + TorchAO quantization (qfloat8) fully working - Layer offloading dtype enforcement (eliminates mat dtype crashes) - SDPA attention mask dtype safety net (patches torch.nn.functional.scal_product_attention) - Safe train_config access (model vs trainer context) - Min-SNR guard for flow-matching schedulers - Precise quantization type checks (fixes PyTorch 2.9+ dequantize backward crash) - Video-safe noise offset (5D tensor support) - torch.compile targeting transformer instead of unet Quality/UX improvements: - Rank/module dropout, gradient checkpointing for LTX-2 - DoRA and LoKr network type support - Higher default rank (32), LR schedulers, caption dropout - Audio loss logging, strict audio mode - Full UI integration for all new fields 16 files changed across toolkit core, LTX-2 model, SDTrainer, and UI. Zero new dependencies. Fully backward compatible.

dyn_mult was stuck at 1.00 because audio loss naturally exceeds video loss on LTX-2. The old clamp max(1.0, ...) prevented dampening. Now max(0.05, ...) allows the multiplier to scale audio down when it dominates.

This major commit completely overhauls two critical components of the repository: 1. AUDIO TRAINING EXCELLENCE: - Integrated full VAE encoding/decoding and log-mel spectrogram conversion for LTX-2. - Accurately processing temporal, audio, and spatial latents natively through the DiT architecture. - Introduced `ComboVae` and `AudioProcessor` for end-to-end audio embedding. - Exposed `audio_a2v_cross_attn` blocks for precise sound/voice fine-tuning. 2. OMNI-MERGE (DO-MERGE 2026 FRAMEWORK): - Eliminated catastrophic character/concept bleeding with Bilateral Subspace Orthogonalization (BSO). - Dynamically isolates prompt triggers, video (motion) pathways, and audio signatures completely mathematically. - Deployed Magnitude/Direction Decoupling (DO-Merge) for structural MLP layers, preventing "loud" LoRAs from crushing weaker ones. - Completely exact rank retention (`Rank A + Rank B` concatenation) without lossy SVD. - Overhauled the Next.js Merge UI to natively trigger the DO-Merge backend with live, real-time `.status` polling, bypassing the legacy Prisma queue entirely. Added comprehensive `RELEASE_NOTES_v1.0_LTX2_OMNI_AUDIO.md` documenting the new architecture. Co-authored-by: Cursor <cursoragent@cursor.com>

…phy and Omni-Merge features Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

…ion (Exact Rank Concatenation) Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

…e them correctly Co-authored-by: Cursor <cursoragent@cursor.com>

…erge) Co-authored-by: Cursor <cursoragent@cursor.com>

…0 SVD approximations, instant execution) Co-authored-by: Cursor <cursoragent@cursor.com>

…s serialization crash with transposed matrices Co-authored-by: Cursor <cursoragent@cursor.com>

…. Eliminates massive matrix allocations while keeping 100% precision Co-authored-by: Cursor <cursoragent@cursor.com>

…arsing Co-authored-by: Cursor <cursoragent@cursor.com>

sintspiden · 2026-03-03T13:30:39Z

I can confirm this does work for training audio: https://streamable.com/8nh2cw

Jonathan and others added 14 commits February 20, 2026 15:01

Merge branch 'ostris:main' into main

878d484

fix: bidirectional auto-balance audio loss — clamp floor 1.0 → 0.05

949a8b8

dyn_mult was stuck at 1.00 because audio loss naturally exceeds video loss on LTX-2. The old clamp max(1.0, ...) prevented dampening. Now max(0.05, ...) allows the multiplier to scale audio down when it dominates.

docs: Rewrite README to reflect BIG DADDY VERSION open source philoso…

21e4e49

…phy and Omni-Merge features Co-authored-by: Cursor <cursoragent@cursor.com>

fix: Remove rogue JSX syntax error in merge page

c31c1fc

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: Remove dynamic rank compression to guarantee 100% data preservat…

bc6f644

…ion (Exact Rank Concatenation) Co-authored-by: Cursor <cursoragent@cursor.com>

fix: Remove dead ZipLoRA import that crashes MergeJob on startup

0a21dca

Co-authored-by: Cursor <cursoragent@cursor.com>

fix: Properly parse PEFT lora_A/lora_B keys to enable merging and sav…

e6069b8

…e them correctly Co-authored-by: Cursor <cursoragent@cursor.com>

perf: Implement randomized low-rank SVD (100x speedup for structure m…

4fd30c8

…erge) Co-authored-by: Cursor <cursoragent@cursor.com>

perf: Mathematical exact bypass for DO-Merge (100% preservation with …

a81f0b7

…0 SVD approximations, instant execution) Co-authored-by: Cursor <cursoragent@cursor.com>

fix: Add .contiguous() to BaseMergeProcess save to prevent safetensor…

890bd9f

…s serialization crash with transposed matrices Co-authored-by: Cursor <cursoragent@cursor.com>

perf: O(R^2*D) mathematical optimization for DO-Merge (3000x speedup)…

b5aec50

…. Eliminates massive matrix allocations while keeping 100% precision Co-authored-by: Cursor <cursoragent@cursor.com>

docs: Update Release Notes with O(R^2*D) Exact Math bypass and PEFT p…

a4616cf

…arsing Co-authored-by: Cursor <cursoragent@cursor.com>

Jonathan and others added 5 commits March 6, 2026 17:56

LTX-2.3 throughput profiles, LTX-only soft lock, and promotion harness

1eb2b80

Add RunPod-ready template for BIG DADDY fork

a251dfe

Fix LTX-2.3 official backend loading and sampling

22adefa

Fix missing audio_loss field on training batches

5068dcb

Fix fresh install path for LTX-2.3 backend

8980686

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AUDIO OPTIMIZATIONS AMONGST OTHER THINGS#720

AUDIO OPTIMIZATIONS AMONGST OTHER THINGS#720
ArtDesignAwesome wants to merge 19 commits intoostris:mainfrom
ArtDesignAwesome:main

ArtDesignAwesome commented Feb 21, 2026 •

edited

Loading

Uh oh!

sintspiden commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ArtDesignAwesome commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LTX-2 Voice Training Fix for Ostris AI-Toolkit

The Problem

What Was Wrong (The Big Ones)

1. Audio and video shared the same timestep during training

2. Audio was never actually extracted from your videos

3. Stale latent cache had no audio in it

4. Video loss drowned out audio loss

5. Audio-free batches caused voice forgetting

6. DoRA and quantized models crashed immediately

7. Auto-balance audio loss was broken — could only boost, never dampen

8. Multiple config/runtime crashes on LTX-2

What We Added

Core Audio Fixes

Quality Improvements

Compatibility Fixes

Installation

Option 1: Copy Modified Files (Recommended)

Option 2: Apply Patches

Important: Delete Your Latent Cache

Recommended Config

LoRA (Recommended — Fast + High Quality)

Layer Offloading (If You Need It)

DoRA Variant (Higher Quality, Slower, More VRAM)

How to Verify Audio Is Training

Performance Expectations

Dataset Tips

FAQ

Files Modified (16 files)

Credits

Uh oh!

sintspiden commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArtDesignAwesome commented Feb 21, 2026 •

edited

Loading