AUDIO OPTIMIZATIONS AMONGST OTHER THINGS#720
Open
ArtDesignAwesome wants to merge 19 commits intoostris:mainfrom
Open
AUDIO OPTIMIZATIONS AMONGST OTHER THINGS#720ArtDesignAwesome wants to merge 19 commits intoostris:mainfrom
ArtDesignAwesome wants to merge 19 commits intoostris:mainfrom
Conversation
Voice training for LTX-2 character LoRAs was completely broken. Audio was silently dropped during extraction (torchaudio/torchcodec failure on Windows/Pinokio), cached latents had no audio data, video and audio shared a single timestep despite the transformer having independent processing paths, and video loss drowned out audio gradients entirely. Core fixes: - Independent audio timestep sampling (decoupled audio/video noise schedules) - Multi-fallback audio extraction (torchaudio → PyAV → ffmpeg CLI) - Latent cache audio validation and automatic invalidation - EMA-based dynamic audio loss balancing - Voice preservation regularizer for audio-free batches - Connector gradient unfreezing for text-to-audio adaptation Compatibility fixes: - DoRA + TorchAO quantization (qfloat8) fully working - Layer offloading dtype enforcement (eliminates mat dtype crashes) - SDPA attention mask dtype safety net (patches torch.nn.functional.scal_product_attention) - Safe train_config access (model vs trainer context) - Min-SNR guard for flow-matching schedulers - Precise quantization type checks (fixes PyTorch 2.9+ dequantize backward crash) - Video-safe noise offset (5D tensor support) - torch.compile targeting transformer instead of unet Quality/UX improvements: - Rank/module dropout, gradient checkpointing for LTX-2 - DoRA and LoKr network type support - Higher default rank (32), LR schedulers, caption dropout - Audio loss logging, strict audio mode - Full UI integration for all new fields 16 files changed across toolkit core, LTX-2 model, SDTrainer, and UI. Zero new dependencies. Fully backward compatible.
dyn_mult was stuck at 1.00 because audio loss naturally exceeds video loss on LTX-2. The old clamp max(1.0, ...) prevented dampening. Now max(0.05, ...) allows the multiplier to scale audio down when it dominates.
This major commit completely overhauls two critical components of the repository: 1. AUDIO TRAINING EXCELLENCE: - Integrated full VAE encoding/decoding and log-mel spectrogram conversion for LTX-2. - Accurately processing temporal, audio, and spatial latents natively through the DiT architecture. - Introduced `ComboVae` and `AudioProcessor` for end-to-end audio embedding. - Exposed `audio_a2v_cross_attn` blocks for precise sound/voice fine-tuning. 2. OMNI-MERGE (DO-MERGE 2026 FRAMEWORK): - Eliminated catastrophic character/concept bleeding with Bilateral Subspace Orthogonalization (BSO). - Dynamically isolates prompt triggers, video (motion) pathways, and audio signatures completely mathematically. - Deployed Magnitude/Direction Decoupling (DO-Merge) for structural MLP layers, preventing "loud" LoRAs from crushing weaker ones. - Completely exact rank retention (`Rank A + Rank B` concatenation) without lossy SVD. - Overhauled the Next.js Merge UI to natively trigger the DO-Merge backend with live, real-time `.status` polling, bypassing the legacy Prisma queue entirely. Added comprehensive `RELEASE_NOTES_v1.0_LTX2_OMNI_AUDIO.md` documenting the new architecture. Co-authored-by: Cursor <cursoragent@cursor.com>
…phy and Omni-Merge features Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ion (Exact Rank Concatenation) Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…e them correctly Co-authored-by: Cursor <cursoragent@cursor.com>
…erge) Co-authored-by: Cursor <cursoragent@cursor.com>
…0 SVD approximations, instant execution) Co-authored-by: Cursor <cursoragent@cursor.com>
…s serialization crash with transposed matrices Co-authored-by: Cursor <cursoragent@cursor.com>
…. Eliminates massive matrix allocations while keeping 100% precision Co-authored-by: Cursor <cursoragent@cursor.com>
…arsing Co-authored-by: Cursor <cursoragent@cursor.com>
|
I can confirm this does work for training audio: https://streamable.com/8nh2cw |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I WAS GETTING CHEWED OUT BY THE COMMUNITY FOR CREATING MY OWN FORK, NO DISRESPECT MEANT. HERE ARE MY CHANGES:
LTX-2 Voice Training Fix for Ostris AI-Toolkit
Voice training for LTX-2 character LoRAs is broken out of the box. This patch fixes it.
If you've tried training an LTX-2 character LoRA and your output has garbled, silent, or completely wrong audio — this is why, and this is the fix.
The Problem
LTX-2 is a joint audio+video diffusion transformer. When you train a character LoRA, the model should learn both the person's appearance AND their voice. In practice, every single person training LTX-2 character LoRAs in
ostris/ai-toolkitgets broken audio. The LoRA produces correct visuals but the voice is destroyed.This isn't a settings issue. There are 25 bugs and design flaws in the training pipeline that collectively make voice training impossible.
What Was Wrong (The Big Ones)
1. Audio and video shared the same timestep during training
LTX-2's transformer has completely separate timestep processing for audio (
audio_adaln_single) and video (adaln_single). The training code sampled ONE random timestep and fed it to BOTH. This means audio never explored its own noise schedule — voice learning was fundamentally broken at the architecture level.Fix: Independent audio timestep sampling. Each training step now samples a separate random timestep for audio. Zero computational cost. Massive quality impact.
2. Audio was never actually extracted from your videos
On Windows and Pinokio installs,
torchaudio.load()is completely broken. The newer versions usetorchcodecas a backend, which requires FFmpeg shared libraries that Pinokio doesn't install. The error was silently caught, so training appeared to work but every single video had its audio silently dropped.Fix: Multi-fallback audio extraction that tries torchaudio first, then PyAV (which bundles its own FFmpeg and is already in
requirements.txt), then ffmpeg CLI as a last resort. Audio extraction now works on every platform.3. Stale latent cache had no audio in it
If you ever ran training before this fix, your latent cache files don't contain
audio_latent. The cache loader only checked if the file existed, not whether it contained audio. So even after fixing audio extraction, the old cache was still being used — with no audio.Fix: Cache validation now checks for the
audio_latentkey inside safetensors files. Missing audio = cache invalidated and re-encoded.4. Video loss drowned out audio loss
Video loss magnitude is much larger than audio loss. Without balancing, the optimizer effectively ignores audio gradients entirely.
Fix: EMA-based dynamic audio loss balancing that targets audio at ~33% of total loss. Adapts automatically over training.
5. Audio-free batches caused voice forgetting
When a batch had no audio (e.g., image-only batches in mixed datasets), zero audio loss was computed. This let the LoRA drift away from the base model's voice generation ability — catastrophic forgetting.
Fix: Synthetic silence regularizer for audio-free batches. Forces the LoRA to preserve base audio behavior even on batches without audio data.
6. DoRA and quantized models crashed immediately
Using DoRA (weight-decomposed LoRA) with quantized models (
qfloat8) caused crashes fromAffineQuantizedTensordispatch errors, dtype mismatches in SDPA attention, andderivative for dequantize is not implementederrors in the backward pass. The root causes were spread across 6 different files.Fix: Precise quantization type detection, safe forward wrappers, dtype enforcement in the memory manager, and SDPA attention mask safety nets. DoRA + quantization + layer offloading now works end to end.
7. Auto-balance audio loss was broken — could only boost, never dampen
The dynamic audio loss balancer was designed to keep audio and video loss in proportion. But the multiplier was clamped at minimum 1.0, meaning it could only INCREASE audio loss. In practice, LTX-2's raw audio loss is naturally ~2x larger than video loss. The computed multiplier (~0.18) was always clamped back to 1.0. The feature was dead code —
dyn_multshowed 1.00 for the entire training run.Fix: Changed the clamp floor from 1.0 to 0.05. The multiplier now works bidirectionally — dampening audio when it's already dominant (common case), boosting when it's too small.
dyn_multactively adjusts throughout training.8. Multiple config/runtime crashes on LTX-2
self.train_configaccessed on the model object instead of the trainer — crash on step 0min_snr_gammaincompatible with flow-matching schedulers — crash on loss calculationprint_and_status_updatecalled on wrong object — crash on audio loggingnoise_offsetrejected 5D video tensors — crash when enabledtorch.compilepointed atunetinstead of transformer — no effect on DiT modelsAll fixed.
What We Added
Core Audio Fixes
Quality Improvements
Compatibility Fixes
qfloat8) fully workingtrain_configaccess made safe for model vs trainer contextdequantizeissue)Installation
Option 1: Copy Modified Files (Recommended)
Copy these files from the release into your
ai-toolkitinstallation, replacing the originals:Option 2: Apply Patches
Patch files are included in the release zip for those who prefer
git apply.Important: Delete Your Latent Cache
If you've trained before, your cached latents don't have audio in them. Delete your latent cache folder and let it re-encode. The new code will extract audio properly via PyAV and include
audio_latentin the cache files.Recommended Config
LoRA (Recommended — Fast + High Quality)
Layer Offloading (If You Need It)
If you're running out of VRAM, add to the model section:
This offloads 56% of transformer layers to CPU. Expect ~20s/it on RTX 5090 with LoRA. Increase the percentage if you still OOM, decrease for speed.
Note:
torch.compileis incompatible withlayer_offloading. Don't use both.DoRA Variant (Higher Quality, Slower, More VRAM)
Replace the network section:
DoRA decomposes weight updates into magnitude and direction components, which can produce higher quality results. However, it requires significantly more VRAM (you'll need higher layer offloading %, which slows training). For most users, LoRA rank 32 is the sweet spot.
How to Verify Audio Is Training
During training, look for this line in your console output:
If you see it, audio loss is actively being computed and your character's voice is being learned. If you don't see it, something is wrong with your audio pipeline — check that:
do_audio: trueis set in your dataset configAfter caching, you should see:
(Where 24 is your number of video clips.)
Performance Expectations
Training speed is dominated by layer offloading (CPU-GPU memory transfer over PCIe). Reduce offloading percentage for speed, increase for VRAM savings.
Dataset Tips
ohwx) that doesn't conflict with the base model vocabularyflip_x: truedoubles your effective dataset size (don't use for text-heavy content)FAQ
Q: Do I need to do anything special for audio?
A: Just set
do_audio: truein your dataset config and make sure your video files have audio tracks. Everything else is automatic.Q: Can I use my existing video dataset?
A: Yes, as long as the videos have audio. Delete your old latent cache first so it re-encodes with audio.
Q: LoRA or DoRA?
A: LoRA rank 32 for most users. It's 3x faster and uses significantly less VRAM. DoRA may produce marginally higher quality but requires much more memory and time.
Q: What about LoKr?
A: Supported but less tested with the audio fixes. LoRA is recommended.
Q: My training shows 0 audio loss / no audio line in logs?
A: Your audio isn't being extracted. Delete latent cache, confirm videos have audio, confirm
do_audio: true.Q: Can I use torch.compile?
A: Only if you're NOT using
layer_offloading. They're mutually exclusive due to how layer offloading mutates GPU buffers during forward passes.Q: What's
independent_audio_timestep?A: The single most important fix. LTX-2's transformer processes audio and video noise schedules independently, but the training code was feeding the same random timestep to both. This decouples them so audio can learn at its own optimal noise level. Always leave this
true.Q: Why is
min_snr_gammaset to 0?A: Min-SNR loss weighting requires
alphas_cumprodfrom DDPM-style schedulers. LTX-2 uses a flow-matching scheduler that doesn't have this. Setting it to anything > 0 would crash. Thetimestep_type: sigmoidorweightedsetting provides equivalent loss balancing for flow-matching models.Files Modified (16 files)
toolkit/config_modules.pytoolkit/train_tools.pytoolkit/dataloader_mixins.pytoolkit/data_transfer_object/data_loader.pytoolkit/models/DoRA.pytoolkit/network_mixins.pytoolkit/memory_management/manager_modules.pyextensions_built_in/diffusion_models/ltx2/ltx2.pyextensions_built_in/sd_trainer/SDTrainer.pyjobs/process/BaseSDTrainProcess.pyui/src/types.tsui/src/app/jobs/new/options.tsui/src/app/jobs/new/SimpleJob.tsxui/src/docs.tsxCredits
Built on top of Ostris AI-Toolkit. All changes are backward compatible — old configs without new keys work identically to before.
25 bugs identified and fixed. Zero new dependencies added. All features use existing PyTorch, torchaudio, diffusers, and PyAV APIs.