Skip to content

Add WAV/MP3 input with automatic 48 kHz resampling and stereo upmix#15

Closed
Copilot wants to merge 8 commits intomasterfrom
copilot/add-wav-mp3-conversion
Closed

Add WAV/MP3 input with automatic 48 kHz resampling and stereo upmix#15
Copilot wants to merge 8 commits intomasterfrom
copilot/add-wav-mp3-conversion

Conversation

Copy link

Copilot AI commented Mar 7, 2026

The --src-audio (cover mode) and neural-codec --encode paths only accepted WAV at exactly 48 kHz. This adds transparent WAV + MP3 support at any sample rate, auto-resampled to 48 kHz and always delivered as stereo — exactly what the VAE encoder requires — with no ffmpeg pre-conversion needed.

New: src/audio.h

Single header providing read_audio(path, T_audio, n_channels):

  • Format detected by extension: .mp3 → dr_mp3, anything else → dr_wav
  • Linear resampler (audio_resample_linear) is channel-agnostic; only runs when sr ≠ 48000
  • Always returns interleaved stereo [T × 2] — mono input is upmixed (L = R), N-channel input uses the first two channels; *n_channels is always 2 on success
  • Uses free() to release dr_libs buffers (both dr_wav and dr_mp3 use the system allocator)
  • Returns malloc'd buffer; caller frees

New: thirdparty/

  • dr_wav.h v0.14.5 — WAV decode (public domain / MIT-0, mackron/dr_libs)
  • dr_mp3.h v0.7.3 — MP3 decode via minimp3 (public domain / MIT-0)

Zero new link-time dependencies — both are single-header, included once per translation unit via #define DR_*_IMPLEMENTATION inside audio.h.

Tool changes

  • neural-codec.cpp: encode path switches read_wav()read_audio(); help text updated
  • dit-vae.cpp: --src-audio switches to read_audio(); help text updated
  • CMakeLists.txt: thirdparty/ added as SYSTEM include in the shared link_ggml_backends macro (vendor warnings suppressed)
  • No existing source files (vae-enc.h etc.) were modified

Example

# Before: required exact 48 kHz stereo WAV, manual ffmpeg conversion otherwise
# After: any of these just work
./dit-vae --src-audio reference.mp3   ...
./dit-vae --src-audio reference.wav   ...  # any sample rate, mono or stereo
./neural-codec --vae vae.gguf --encode -i song.mp3 -o song.latent

New example

examples/cover.sh + examples/cover.json — demonstrates cover-mode generation from a WAV or MP3 reference track with inline usage notes.


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Copilot AI and others added 2 commits March 7, 2026 19:02
…ols and cover example

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Copilot AI changed the title [WIP] Add support for WAV and MP3 input conversion to 48kHz Add WAV/MP3 input with automatic 48 kHz resampling Mar 7, 2026
lmangani added 3 commits March 7, 2026 20:08
Updated the cover script to remove LLM step and clarify audio processing.
Updated the cover JSON to reflect new attributes and changes.
@lmangani
Copy link

lmangani commented Mar 8, 2026

@copilot review the following report. remember we should convert stereo->stereo and mono->stereo and NOT touch the acestep-cpp existing files which work when invoked manually.

Known issues in src/audio.h (pending upstream fix)

Bug: mono audio is not upmixed to stereo before encoding

read_audio() returns a native-channel-count buffer ([T x n_channels] floats),
but vae_enc_compute() in vae-enc.h always reads two channels:

// vae-enc.h (hardcodes stereo access — UB when n_channels == 1)
for (int c = 0; c < 2; c++) {
    for (int t = 0; t < T_audio; t++) {
        m->scratch_in[c * T_audio + t] = audio[t * 2 + c];
    }
}

For stereo inputs (most user uploads) this works correctly.
For mono inputs the second channel index reads out-of-bounds memory.

Required fix in src/audio.h — always return interleaved stereo [T x 2].
Add this block after resampling completes (before the final return out):

// Upmix mono -> stereo, or use first two channels of N-ch audio.
if ((int) channels != 2) {
    int    n_ch_src = (int) channels;
    float *stereo   = (float *) malloc((size_t) T_raw * 2 * sizeof(float));
    if (!stereo) {
        fprintf(stderr, "[Audio] Out of memory converting to stereo\n");
        free(out);
        return NULL;
    }
    for (int t = 0; t < T_raw; t++) {
        float L = out[(size_t) t * n_ch_src + 0];
        float R = (n_ch_src > 1) ? out[(size_t) t * n_ch_src + 1] : L;
        stereo[t * 2 + 0] = L;
        stereo[t * 2 + 1] = R;
    }
    free(out);
    out = stereo;
    fprintf(stderr, "[Audio] Converted %dch -> stereo\n", n_ch_src);
}
*n_channels = 2;

Also: replace drwav_free(raw, NULL) with free(raw) in the resampling
branch — both dr_wav and dr_mp3 use the system allocator by default, so
free() is always safe regardless of which decoder produced the buffer.

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Copy link
Author

Copilot AI commented Mar 8, 2026

@copilot review the following report. remember we should convert stereo->stereo and mono->stereo and NOT touch the acestep-cpp existing files which work when invoked manually.

##...

Fixed in 7ea3602.

read_audio() now always returns interleaved stereo [T * 2] and sets *n_channels = 2:

  • Mono → stereo: L = R = channel[0] (upmix after resampling, so both resampling and upmix work for mono at any rate)
  • N > 2 channels → stereo: uses first two channels only
  • drwav_free(raw, NULL)free(raw): both dr_libs use the system allocator by default, so this is always correct

No existing files (vae-enc.h etc.) were touched.

Copilot AI changed the title Add WAV/MP3 input with automatic 48 kHz resampling Add WAV/MP3 input with automatic 48 kHz resampling and stereo upmix Mar 8, 2026
@lmangani lmangani closed this Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants