Skip to content

feat: distributed index train#6334

Closed
summaryzb wants to merge 1 commit intolance-format:mainfrom
summaryzb:distributed_index_train
Closed

feat: distributed index train#6334
summaryzb wants to merge 1 commit intolance-format:mainfrom
summaryzb:distributed_index_train

Conversation

@summaryzb
Copy link
Copy Markdown

@summaryzb summaryzb commented Mar 30, 2026

Summary

This PR adds a distributed vector index training module that enables IVF-based model training (IVF+PQ and IVF+SQ) across multiple worker nodes. It provides two strategies: a single-node local training path that wraps existing functions, and a distributed sampling path where workers independently sample fragments, compute local SQ bounds, and a master node aggregates the results to train the final models.

Problem

Lance's existing IVF/PQ/SQ training pipeline (build_ivf_model, build_pq_model) operates on a single node that must load all training data into memory. For large-scale datasets, this becomes a bottleneck because a single machine may not have enough memory or compute to sample and train efficiently. There was no mechanism to distribute the sampling and training workload across multiple workers, which is essential for production-scale vector index building.

Approach

The module introduces two training strategies behind a unified TrainedModels output type:

Strategy A (train_local): A thin wrapper that orchestrates existing build_ivf_model and build_pq_model functions on a single node. For SQ, it samples data, optionally normalizes for Cosine metric, builds a ScalarQuantizer, and extracts the global min/max bounds. This provides a clean API for single-node use cases.

Strategy B (distributed sampling): A three-phase protocol designed for multi-worker environments:

  1. create_sample_tasks -- Plans the work by distributing fragments across workers using round-robin assignment. The total sample size is computed as the maximum of IVF, PQ, and SQ sample requirements, then divided evenly across workers with remainder distribution.
  2. execute_sample_task -- Each worker independently samples vectors from its assigned fragments, filters non-finite values, and optionally computes local SQ bounds. For Cosine metric, data is normalized before SQ bound computation.
  3. train_from_samples -- The master node concatenates all worker samples, merges SQ bounds globally (taking the min of all local mins and max of all local maxes), normalizes for Cosine if needed, then trains IVF centroids via KMeans and optionally trains PQ on residuals.

Key design decisions:

  • SQ bounds are computed locally on each worker and merged globally, avoiding the need to send all raw data to the master for bound computation.
  • For PQ training in the distributed path, residuals are computed against the trained IVF centroids before building the PQ codebook, matching the existing single-node behavior.
  • Cosine distance is consistently remapped to L2 after normalization across both strategies.
  • The train_ivf_model function was promoted from private to pub(crate) to allow the distributed module to call it directly with pre-sampled data (bypassing the dataset-level sampling in build_ivf_model).
  • Memory management is handled explicitly: training data is dropped before PQ build, and a warning is logged when concatenated samples exceed 4 GB.

@github-actions github-actions bot added the java label Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@summaryzb summaryzb force-pushed the distributed_index_train branch from 01734fb to e255b80 Compare March 30, 2026 09:40
@summaryzb summaryzb changed the title Distributed index train feat: distributed index train Mar 30, 2026
@github-actions github-actions bot added the enhancement New feature or request label Mar 30, 2026
@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Mar 31, 2026

Hi, distributed index training should already be supported, as build_ivf_model can accept a slice of fragment IDs. This is demonstrated here: #6296

My current view on this topic is that Lance, as a library, should not directly provide a built-in task abstraction. Instead, Lance should offer APIs based on fragment IDs. This would allow downstream users, such as the Python/Rust SDKs or Lance Spark, to build their own task or coordination abstractions.

@summaryzb
Copy link
Copy Markdown
Author

@Xuanwo Thank you for reviewing. I will look into #6296 and figure out how to integrate it into the Java layer.

@summaryzb
Copy link
Copy Markdown
Author

Hi, distributed index training should already be supported, as build_ivf_model can accept a slice of fragment IDs. This is demonstrated here: #6296

My current view on this topic is that Lance, as a library, should not directly provide a built-in task abstraction. Instead, Lance should offer APIs based on fragment IDs. This would allow downstream users, such as the Python/Rust SDKs or Lance Spark, to build their own task or coordination abstractions.

Follow the suggestion, close this pr, move on to #6363

@summaryzb summaryzb closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants