feat: distributed index train by summaryzb · Pull Request #6334 · lance-format/lance

summaryzb · 2026-03-30T09:35:32Z

Summary

This PR adds a distributed vector index training module that enables IVF-based model training (IVF+PQ and IVF+SQ) across multiple worker nodes. It provides two strategies: a single-node local training path that wraps existing functions, and a distributed sampling path where workers independently sample fragments, compute local SQ bounds, and a master node aggregates the results to train the final models.

Problem

Lance's existing IVF/PQ/SQ training pipeline (build_ivf_model, build_pq_model) operates on a single node that must load all training data into memory. For large-scale datasets, this becomes a bottleneck because a single machine may not have enough memory or compute to sample and train efficiently. There was no mechanism to distribute the sampling and training workload across multiple workers, which is essential for production-scale vector index building.

Approach

The module introduces two training strategies behind a unified TrainedModels output type:

Strategy A (train_local): A thin wrapper that orchestrates existing build_ivf_model and build_pq_model functions on a single node. For SQ, it samples data, optionally normalizes for Cosine metric, builds a ScalarQuantizer, and extracts the global min/max bounds. This provides a clean API for single-node use cases.

Strategy B (distributed sampling): A three-phase protocol designed for multi-worker environments:

create_sample_tasks -- Plans the work by distributing fragments across workers using round-robin assignment. The total sample size is computed as the maximum of IVF, PQ, and SQ sample requirements, then divided evenly across workers with remainder distribution.
execute_sample_task -- Each worker independently samples vectors from its assigned fragments, filters non-finite values, and optionally computes local SQ bounds. For Cosine metric, data is normalized before SQ bound computation.
train_from_samples -- The master node concatenates all worker samples, merges SQ bounds globally (taking the min of all local mins and max of all local maxes), normalizes for Cosine if needed, then trains IVF centroids via KMeans and optionally trains PQ on residuals.

Key design decisions:

SQ bounds are computed locally on each worker and merged globally, avoiding the need to send all raw data to the master for bound computation.
For PQ training in the distributed path, residuals are computed against the trained IVF centroids before building the PQ codebook, matching the existing single-node behavior.
Cosine distance is consistently remapped to L2 after normalization across both strategies.
The train_ivf_model function was promoted from private to pub(crate) to allow the distributed module to call it directly with pre-sampled data (bypassing the dataset-level sampling in build_ivf_model).
Memory management is handled explicitly: training data is dropped before PQ build, and a warning is logged when concatenated samples exceed 4 GB.

github-actions · 2026-03-30T09:35:55Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Xuanwo · 2026-03-31T08:15:59Z

Hi, distributed index training should already be supported, as build_ivf_model can accept a slice of fragment IDs. This is demonstrated here: #6296

My current view on this topic is that Lance, as a library, should not directly provide a built-in task abstraction. Instead, Lance should offer APIs based on fragment IDs. This would allow downstream users, such as the Python/Rust SDKs or Lance Spark, to build their own task or coordination abstractions.

summaryzb · 2026-03-31T12:33:30Z

@Xuanwo Thank you for reviewing. I will look into #6296 and figure out how to integrate it into the Java layer.

summaryzb · 2026-03-31T13:47:56Z

Hi, distributed index training should already be supported, as build_ivf_model can accept a slice of fragment IDs. This is demonstrated here: #6296

My current view on this topic is that Lance, as a library, should not directly provide a built-in task abstraction. Instead, Lance should offer APIs based on fragment IDs. This would allow downstream users, such as the Python/Rust SDKs or Lance Spark, to build their own task or coordination abstractions.

Follow the suggestion, close this pr, move on to #6363

github-actions bot added the java label Mar 30, 2026

support distribute train

e255b80

summaryzb force-pushed the distributed_index_train branch from 01734fb to e255b80 Compare March 30, 2026 09:40

summaryzb changed the title ~~Distributed index train~~ feat: distributed index train Mar 30, 2026

github-actions bot added the enhancement New feature or request label Mar 30, 2026

summaryzb closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: distributed index train#6334

feat: distributed index train#6334
summaryzb wants to merge 1 commit intolance-format:mainfrom
summaryzb:distributed_index_train

summaryzb commented Mar 30, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

Xuanwo commented Mar 31, 2026 •

edited

Loading

Uh oh!

summaryzb commented Mar 31, 2026

Uh oh!

summaryzb commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

summaryzb commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Approach

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

Xuanwo commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

summaryzb commented Mar 31, 2026

Uh oh!

summaryzb commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

summaryzb commented Mar 30, 2026 •

edited

Loading

Xuanwo commented Mar 31, 2026 •

edited

Loading