perf(python): optimize GPU IVF training sampling and in-memory KMeans pipeline by hushengquan · Pull Request #6383 · lance-format/lance

hushengquan · 2026-04-02T08:21:16Z

Summary

Optimizes the Python GPU IVF centroid training path to eliminate redundant I/O and
align with the efficient Rust CPU training strategy.

Changes

`sampler.py`

Rewrite _efficient_sample: Generate n uniformly random indices, sort them,
and take in large contiguous chunks (8192 rows per take). Sorting enables the object
store to merge adjacent row reads into fewer, larger range requests — drastically
reducing S3 I/O latency.
Fix batch yield logic: Replace if guard with while loop that correctly slices
accumulated rows into batch_size-sized RecordBatches.
Simplify maybe_sample: Remove max_takes branching; always delegate to
_efficient_sample. Deprecate the max_takes parameter (kept for API compat).
Inline target_takes in _filtered_efficient_sample: Compute internally instead
of accepting as a parameter.

`vector.py`

Single in-memory sampling: Call maybe_sample once, materialize as a numpy array
(~384 MB for 65536×1536 float32), and reuse for both initial centroid selection and
KMeans training.
Remove TorchDataset + CachedDataset dependency: No more disk-based IPC
caching. Pass torch.Tensor directly to KMeans.fit(), which wraps it in a
TensorDataset for pure in-memory iteration.
Random centroid init from memory: np.random.choice on the in-memory array
instead of sampling another k rows from disk.

Performance Impact

Benchmarked on 5M rows × 1536-dim float32, S3-backed dataset, k=256,
sample_rate=256, max_iters=50, GPU=Apple MPS.

Before (double sampling + disk cache + 2048 small random reads)

Phase	Time	Notes
1st sampling (init centroids)	2.4s	Sample 65536 rows, only use 256
2nd sampling epoch 0	226s	2048 small takes (32 rows each), random I/O
KMeans epoch 0 compute	5s	GPU compute after data loaded
2nd sampling epoch 1 (caching)	141s	Re-read all data, write IPC disk cache
KMeans epochs 1-21 (cached)	3s	Subsequent epochs read from disk cache
Total	379s

After (single sampling, sorted chunked reads, in-memory training)

Phase	Time	Notes
Sampling (once)	62.4s	8 sorted chunks (8192 rows each) from S3
Load into memory	0.3s	384 MB numpy array
KMeans epochs 0-21 (in-memory)	1.8s	Pure in-memory Tensor, zero I/O per epoch
Total	65.6s

Summary

~5.8× faster end-to-end IVF training (379s → 65.6s)
Sampling I/O reduced from ~370s (2 × 2048 small random reads) to 62s
(1 × 8 large sorted chunked reads)
KMeans training reduced from ~8s + disk overhead to 1.8s (pure in-memory)
Eliminated disk IPC cache (CachedDataset) entirely
Eliminated per-epoch Arrow → numpy → Tensor conversion overhead
Memory: 384 MB bounded (k × sample_rate × dim × 4B), matches Rust CPU path

… with chunked take

…line

hushengquan · 2026-04-03T07:32:45Z

@Xuanwo Hi, could you please take a look at this PR? We've recently been using GPUs for pre-training and noticed a large volume of small I/O operations, which is resulting in poor performance.

Xuanwo · 2026-04-03T08:26:45Z

Thank you for this work! I'm working on the GPU side too, will take a look.

hushengquan · 2026-04-03T08:31:01Z

Thank you for this work! I'm working on the GPU side too, will take a look.

Thank you! I have also observed that create_index is currently quite slow when training PQ models on the GPU—taking significantly longer than training on the CPU. If possible, I would be happy to continue optimizing this area.

github-actions bot added python performance labels Apr 2, 2026

hushengquan force-pushed the feat-model branch from bde7743 to 2a8bc0f Compare April 2, 2026 09:39

perf(python): optimize _efficient_sample to use sorted random indices…

67d5497

… with chunked take

hushengquan force-pushed the feat-model branch from 2a8bc0f to 67d5497 Compare April 2, 2026 09:43

perf(python): optimize GPU IVF training sampling and KMeans data pipe…

0696660

…line

hushengquan changed the title ~~perf: align Python _efficient_sample with Rust sorted-index take strategy~~ perf(python): optimize GPU IVF training sampling and in-memory KMeans pipeline Apr 2, 2026

Merge branch 'main' into feat-model

ff91075

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(python): optimize GPU IVF training sampling and in-memory KMeans pipeline#6383

perf(python): optimize GPU IVF training sampling and in-memory KMeans pipeline#6383
hushengquan wants to merge 3 commits intolance-format:mainfrom
hushengquan:feat-model

hushengquan commented Apr 2, 2026 •

edited

Loading

Uh oh!

hushengquan commented Apr 3, 2026

Uh oh!

Xuanwo commented Apr 3, 2026

Uh oh!

hushengquan commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hushengquan commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

sampler.py

vector.py

Performance Impact

Before (double sampling + disk cache + 2048 small random reads)

After (single sampling, sorted chunked reads, in-memory training)

Summary

Uh oh!

hushengquan commented Apr 3, 2026

Uh oh!

Xuanwo commented Apr 3, 2026

Uh oh!

hushengquan commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hushengquan commented Apr 2, 2026 •

edited

Loading

`sampler.py`

`vector.py`

hushengquan commented Apr 3, 2026 •

edited

Loading