MB-71397: Fix CPU OOM during GPU Train by CascadingRadium · Pull Request #76 · blevesearch/faiss

CascadingRadium · 2026-05-13T14:19:37Z

Training large datasets with high-dimensional vectors on GPU caused an out-of-memory
crash due to allocating residuals for the full training set.
The GPU scalar quantizer encoder training now subsamples the input to a bounded number
of vectors, mirroring the existing CPU behaviour.
The encoder training vector limit is propagated correctly when cloning a CPU index to GPU.

Co-authored-by: Copilot <copilot@github.com>

Copilot

Pull request overview

This PR addresses CPU out-of-memory crashes during GPU training of IVF+ScalarQuantizer indexes by introducing training-time subsampling (to avoid allocating full-size residual buffers) and by propagating the encoder-training vector limit when cloning a CPU IVF index to GPU.

Changes:

Add subsampling in GpuIndexIVFScalarQuantizer::trainResiduals_ using fvecs_maybe_subsample.
Introduce GpuIndexIVF::train_encoder_num_vectors() and store an encoder-training vector limit in GpuIndexIVF.
Propagate IndexIVF::train_encoder_num_vectors() from CPU to GPU in GpuIndexIVF::copyFrom.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`faiss/gpu/GpuIndexIVFScalarQuantizer.cu`	Subsamples training vectors before residual computation to reduce CPU memory usage during GPU training.
`faiss/gpu/GpuIndexIVF.h`	Adds a virtual encoder-training vector limit accessor and stores the propagated limit.
`faiss/gpu/GpuIndexIVF.cu`	Copies the CPU encoder-training vector limit into GPU state and exposes it via a new accessor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CascadingRadium · 2026-05-13T14:28:30Z

When training on the CPU index we hit this code:

void IndexIVF::train(idx_t n, const float* x) {
    if (verbose) {
        printf("Training level-1 quantizer\n");
    }

// Train Quantizer

    train_q1(n, x, verbose, metric_type);

    if (verbose) {
        printf("Training IVF residual\n");
    }

    // optional subsampling
    idx_t max_nt = train_encoder_num_vectors();
    if (max_nt <= 0) {
        max_nt = (size_t)1 << 35;
    }

// Train Residuals

    TransformedVectors tv(
            x, fvecs_maybe_subsample(d, (size_t*)&n, max_nt, x, verbose));

    if (by_residual) {
        std::vector<idx_t> assign(n);
        quantizer->assign(n, tv.x, assign.data());

        std::vector<float> residuals(n * d); // <--- OOM LINE
        quantizer->compute_residual_n(n, tv.x, residuals.data(), assign.data());

        train_encoder(n, residuals.data(), assign.data());
    } else {
        train_encoder(n, tv.x, nullptr);
    }

    is_trained = true;
}

We basically:

Train the quantizer
Train the encoder/residuals

On the GPU side of things, we do not subsample the vectors resulting in hitting the OOM LINE on the GPU code, since we will create an unbounded vector.

This patch fixes this by mimicing the CPU behaviour for subsampling of the residual vectors

Thejas-bhat

looks good, but can you add the MB associated with this?

Fix CPU OOM during GPU Train

85ef55c

Co-authored-by: Copilot <copilot@github.com>

CascadingRadium requested review from Likith101, Samsonnyyeet, Thejas-bhat, capemox, Copilot, maneuvertomars and steveyen May 13, 2026 14:19

Copilot started reviewing on behalf of CascadingRadium May 13, 2026 14:20 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread faiss/gpu/GpuIndexIVFScalarQuantizer.cu

Thejas-bhat reviewed May 13, 2026

View reviewed changes

CascadingRadium changed the title ~~Fix CPU OOM during GPU Train~~ MB-71397: Fix CPU OOM during GPU Train May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MB-71397: Fix CPU OOM during GPU Train#76

MB-71397: Fix CPU OOM during GPU Train#76
CascadingRadium wants to merge 1 commit into
fixSizefrom
fixOOM2

CascadingRadium commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

CascadingRadium commented May 13, 2026

Uh oh!

Thejas-bhat left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CascadingRadium commented May 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

CascadingRadium commented May 13, 2026

Uh oh!

Thejas-bhat left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants