You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR adds a distributed vector index training module that enables IVF-based model training (IVF+PQ and IVF+SQ) across multiple worker nodes. It provides two strategies: a single-node local training path that wraps existing functions, and a distributed sampling path where workers independently sample fragments, compute local SQ bounds, and a master node aggregates the results to train the final models.
Problem
Lance's existing IVF/PQ/SQ training pipeline (build_ivf_model, build_pq_model) operates on a single node that must load all training data into memory. For large-scale datasets, this becomes a bottleneck because a single machine may not have enough memory or compute to sample and train efficiently. There was no mechanism to distribute the sampling and training workload across multiple workers, which is essential for production-scale vector index building.
Approach
The module introduces two training strategies behind a unified TrainedModels output type:
Strategy A (train_local): A thin wrapper that orchestrates existing build_ivf_model and build_pq_model functions on a single node. For SQ, it samples data, optionally normalizes for Cosine metric, builds a ScalarQuantizer, and extracts the global min/max bounds. This provides a clean API for single-node use cases.
Strategy B (distributed sampling): A three-phase protocol designed for multi-worker environments:
create_sample_tasks -- Plans the work by distributing fragments across workers using round-robin assignment. The total sample size is computed as the maximum of IVF, PQ, and SQ sample requirements, then divided evenly across workers with remainder distribution.
execute_sample_task -- Each worker independently samples vectors from its assigned fragments, filters non-finite values, and optionally computes local SQ bounds. For Cosine metric, data is normalized before SQ bound computation.
train_from_samples -- The master node concatenates all worker samples, merges SQ bounds globally (taking the min of all local mins and max of all local maxes), normalizes for Cosine if needed, then trains IVF centroids via KMeans and optionally trains PQ on residuals.
Key design decisions:
SQ bounds are computed locally on each worker and merged globally, avoiding the need to send all raw data to the master for bound computation.
For PQ training in the distributed path, residuals are computed against the trained IVF centroids before building the PQ codebook, matching the existing single-node behavior.
Cosine distance is consistently remapped to L2 after normalization across both strategies.
The train_ivf_model function was promoted from private to pub(crate) to allow the distributed module to call it directly with pre-sampled data (bypassing the dataset-level sampling in build_ivf_model).
Memory management is handled explicitly: training data is dropped before PQ build, and a warning is logged when concatenated samples exceed 4 GB.
Hi, distributed index training should already be supported, as build_ivf_model can accept a slice of fragment IDs. This is demonstrated here: #6296
My current view on this topic is that Lance, as a library, should not directly provide a built-in task abstraction. Instead, Lance should offer APIs based on fragment IDs. This would allow downstream users, such as the Python/Rust SDKs or Lance Spark, to build their own task or coordination abstractions.
Hi, distributed index training should already be supported, as build_ivf_model can accept a slice of fragment IDs. This is demonstrated here: #6296
My current view on this topic is that Lance, as a library, should not directly provide a built-in task abstraction. Instead, Lance should offer APIs based on fragment IDs. This would allow downstream users, such as the Python/Rust SDKs or Lance Spark, to build their own task or coordination abstractions.
Follow the suggestion, close this pr, move on to #6363
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a distributed vector index training module that enables IVF-based model training (IVF+PQ and IVF+SQ) across multiple worker nodes. It provides two strategies: a single-node local training path that wraps existing functions, and a distributed sampling path where workers independently sample fragments, compute local SQ bounds, and a master node aggregates the results to train the final models.
Problem
Lance's existing IVF/PQ/SQ training pipeline (
build_ivf_model,build_pq_model) operates on a single node that must load all training data into memory. For large-scale datasets, this becomes a bottleneck because a single machine may not have enough memory or compute to sample and train efficiently. There was no mechanism to distribute the sampling and training workload across multiple workers, which is essential for production-scale vector index building.Approach
The module introduces two training strategies behind a unified
TrainedModelsoutput type:Strategy A (
train_local): A thin wrapper that orchestrates existingbuild_ivf_modelandbuild_pq_modelfunctions on a single node. For SQ, it samples data, optionally normalizes for Cosine metric, builds aScalarQuantizer, and extracts the global min/max bounds. This provides a clean API for single-node use cases.Strategy B (distributed sampling): A three-phase protocol designed for multi-worker environments:
create_sample_tasks-- Plans the work by distributing fragments across workers using round-robin assignment. The total sample size is computed as the maximum of IVF, PQ, and SQ sample requirements, then divided evenly across workers with remainder distribution.execute_sample_task-- Each worker independently samples vectors from its assigned fragments, filters non-finite values, and optionally computes local SQ bounds. For Cosine metric, data is normalized before SQ bound computation.train_from_samples-- The master node concatenates all worker samples, merges SQ bounds globally (taking the min of all local mins and max of all local maxes), normalizes for Cosine if needed, then trains IVF centroids via KMeans and optionally trains PQ on residuals.Key design decisions:
train_ivf_modelfunction was promoted from private topub(crate)to allow the distributed module to call it directly with pre-sampled data (bypassing the dataset-level sampling inbuild_ivf_model).