Skip to content

perf(python): optimize GPU IVF training sampling and in-memory KMeans pipeline#6383

Open
hushengquan wants to merge 3 commits intolance-format:mainfrom
hushengquan:feat-model
Open

perf(python): optimize GPU IVF training sampling and in-memory KMeans pipeline#6383
hushengquan wants to merge 3 commits intolance-format:mainfrom
hushengquan:feat-model

Conversation

@hushengquan
Copy link
Copy Markdown
Contributor

@hushengquan hushengquan commented Apr 2, 2026

Summary

Optimizes the Python GPU IVF centroid training path to eliminate redundant I/O and
align with the efficient Rust CPU training strategy.

Changes

sampler.py

  • Rewrite _efficient_sample: Generate n uniformly random indices, sort them,
    and take in large contiguous chunks (8192 rows per take). Sorting enables the object
    store to merge adjacent row reads into fewer, larger range requests — drastically
    reducing S3 I/O latency.
  • Fix batch yield logic: Replace if guard with while loop that correctly slices
    accumulated rows into batch_size-sized RecordBatches.
  • Simplify maybe_sample: Remove max_takes branching; always delegate to
    _efficient_sample. Deprecate the max_takes parameter (kept for API compat).
  • Inline target_takes in _filtered_efficient_sample: Compute internally instead
    of accepting as a parameter.

vector.py

  • Single in-memory sampling: Call maybe_sample once, materialize as a numpy array
    (~384 MB for 65536×1536 float32), and reuse for both initial centroid selection and
    KMeans training.
  • Remove TorchDataset + CachedDataset dependency: No more disk-based IPC
    caching. Pass torch.Tensor directly to KMeans.fit(), which wraps it in a
    TensorDataset for pure in-memory iteration.
  • Random centroid init from memory: np.random.choice on the in-memory array
    instead of sampling another k rows from disk.

Performance Impact

Benchmarked on 5M rows × 1536-dim float32, S3-backed dataset, k=256,
sample_rate=256, max_iters=50, GPU=Apple MPS.

Before (double sampling + disk cache + 2048 small random reads)

Phase Time Notes
1st sampling (init centroids) 2.4s Sample 65536 rows, only use 256
2nd sampling epoch 0 226s 2048 small takes (32 rows each), random I/O
KMeans epoch 0 compute 5s GPU compute after data loaded
2nd sampling epoch 1 (caching) 141s Re-read all data, write IPC disk cache
KMeans epochs 1-21 (cached) 3s Subsequent epochs read from disk cache
Total 379s

After (single sampling, sorted chunked reads, in-memory training)

Phase Time Notes
Sampling (once) 62.4s 8 sorted chunks (8192 rows each) from S3
Load into memory 0.3s 384 MB numpy array
KMeans epochs 0-21 (in-memory) 1.8s Pure in-memory Tensor, zero I/O per epoch
Total 65.6s

Summary

  • ~5.8× faster end-to-end IVF training (379s → 65.6s)
  • Sampling I/O reduced from ~370s (2 × 2048 small random reads) to 62s
    (1 × 8 large sorted chunked reads)
  • KMeans training reduced from ~8s + disk overhead to 1.8s (pure in-memory)
  • Eliminated disk IPC cache (CachedDataset) entirely
  • Eliminated per-epoch Arrow → numpy → Tensor conversion overhead
  • Memory: 384 MB bounded (k × sample_rate × dim × 4B), matches Rust CPU path

@hushengquan hushengquan changed the title perf: align Python _efficient_sample with Rust sorted-index take strategy perf(python): optimize GPU IVF training sampling and in-memory KMeans pipeline Apr 2, 2026
@hushengquan
Copy link
Copy Markdown
Contributor Author

@Xuanwo Hi, could you please take a look at this PR? We've recently been using GPUs for pre-training and noticed a large volume of small I/O operations, which is resulting in poor performance.

@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Apr 3, 2026

Thank you for this work! I'm working on the GPU side too, will take a look.

@hushengquan
Copy link
Copy Markdown
Contributor Author

hushengquan commented Apr 3, 2026

Thank you for this work! I'm working on the GPU side too, will take a look.

Thank you! I have also observed that create_index is currently quite slow when training PQ models on the GPU—taking significantly longer than training on the CPU. If possible, I would be happy to continue optimizing this area.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants