Add learned index structures research documentation #984

Copilot · 2026-02-01T09:38:04Z

Description

Comprehensive research documentation for learned index structures in vector search. Evaluates ML-based alternatives to HNSW: neural k-NN predictors, learned hash functions (SONG), GNN-enhanced navigation, and hybrid approaches. Includes implementation roadmap, benchmarks, and integration with existing ThemisDB components (LearnedQuantizer, LoRA-RAID, GPU infrastructure).

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
♻️ Code refactoring (no functional changes)
⚡ Performance improvement
✅ Test addition or update
🔧 Configuration change
🎨 UI/UX change

Related Issues

Changes Made

New Research Document (LEARNED_INDEX_STRUCTURES_RESEARCH.md - 105KB, 3,505 lines):

Current State Analysis: HNSW performance characteristics, LearnedQuantizer capabilities, GPU support baseline
Five Learned Approaches:
- Neural Approximate Nearest Neighbor (NANN) - end-to-end k-NN prediction
- Learning to Hash - SONG (2-5x faster than FAISS-GPU), HashNet, DPSH
- Learned Space Partitioning - ScaNN, neural IVF optimization
- GNN-Enhanced HNSW - learned routing policies (+5-10% recall)
- Hybrid architectures with traditional fallback
Implementation Roadmap: 5-phase plan (6-12 months), resource estimates ($150k-200k), Go/No-Go criteria (≥10% improvement)
Benchmark Framework: Datasets (SIFT1M, GIST1M, Deep1B), metrics (recall@k, latency, memory), baselines
Integration Design: C++ API with LibTorch/ONNX Runtime, training pipeline, monitoring, model versioning
Production Considerations: 10 risks with mitigation, hybrid fallback strategies, GPU requirements
State-of-the-Art: 15+ papers (Kraska SIGMOD'18, Prokhorenkova KDD'20, Jaiswal NeurIPS'22)

Updated Documentation:

docs/research/README.md: Added new entry, updated changelog to v3.1

Testing

Test Environment

OS: N/A (Documentation only)
Compiler: N/A
Build Type: N/A

Test Results

All existing tests pass
New tests added for changes
Manual testing performed

Test Commands

# No code changes - documentation only

Checklist

My code follows the coding standards
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

Code Quality

Code builds without errors
Code builds without warnings
Static analysis (cppcheck) passes
No memory leaks detected
Code follows C++17 standards

Documentation

README.md updated (if applicable)
CHANGELOG.md updated
API documentation updated (if applicable)
Code comments added/updated

Branch Strategy Compliance

PR targets the correct branch (develop for features, main for releases/hotfixes)
Branch naming follows convention (e.g., feature/, bugfix/, hotfix/, release/)
No direct commits to main or develop

Performance Impact

No significant performance impact
Performance improvement (describe below)
Performance regression (justify below)

Performance Notes:
Documentation-only change. Performance implications discussed theoretically in research document (2-5x potential speedup with SONG, 5-10% recall improvement with GNN-enhanced navigation).

Breaking Changes

No breaking changes.

Security Considerations

No security implications
Security review required
Dependencies updated to secure versions

Additional Notes

Document Scope: Connects existing ThemisDB capabilities (LearnedQuantizer quantization, HNSW graph structure, LoRA-RAID multi-GPU, CUDA/HIP) with learned index literature. Provides actionable decision framework for Phase 1 prototype (3 months).

Key Decision Points:

Start with GNN-enhanced HNSW routing (lowest risk, 5-10% gain)
Evaluate SONG and deep hashing in parallel
Go/No-Go after benchmarks on SIFT1M/GIST1M (≥10% improvement threshold)

Synergies: Aligns with existing GNN research (GNN_BASED_INDEXING_AND_EMBEDDINGS.md), leverages production ML infrastructure.

Screenshots/Logs

N/A - Documentation only

For Maintainers:

Review Checklist

Merge Strategy

Squash and merge (✅ Recommended for feature/bugfix PRs - cleaner history)
Merge commit (Only for release/hotfix branches)
Rebase and merge

Original prompt

This section details on the original issue you should resolve

<issue_title>[LEARNED INDEX]</issue_title>
<issue_description>## Learned Index Structures Research / Forschung zu gelernten Indexstrukturen

Research Topic / Forschungsthema

Background / Hintergrund

Current Indexing in ThemisDB

Method:
Type: [ ] Traditional/Algorithmic [ ] Learned/Neural [ ] Hybrid
Query Performance:
Index Size:

Problem Statement / Problemstellung

Traditional Limitations:
Potential Benefits of Learned Indexes:

Research Focus / Forschungsschwerpunkt

Learned Index Approaches / Ansätze für gelernte Indexe

Neural Approximate Nearest Neighbor (NANN)
- Train neural networks to predict ANN results
- Papers: Plaut & Roughgarden (NeurIPS 2020)
- Expected benefit: Adaptive to data distribution
Learning to Hash
- Deep hashing with neural networks
- Papers: Cao et al. (CVPR 2017), Liu et al. (IJCAI 2018)
- Expected benefit: Compact binary codes, fast retrieval
Learned Space Partitioning
- Neural networks for clustering/partitioning
- Papers: Kraska et al. (SIGMOD 2018), Ferragina & Vinciguerra (VLDB 2020)
- Expected benefit: Better space partitioning than k-means
End-to-End Learned Vector Search
- Differentiable indexes trained end-to-end
- Papers: Xu et al. (ICML 2021), Jaiswal et al. (NeurIPS 2022)
- Expected benefit: Optimized for specific workload
Hybrid Learned/Traditional Indexes
- Combine learned components with traditional indexes
- Papers: Davitkova et al. (VLDB 2021)
- Expected benefit: Best of both worlds
Graph Neural Networks for Vector Search
- GNN-based navigation in vector graphs
- Papers: Prokhorenkova et al. (KDD 2020)
- Expected benefit: Better HNSW-style graph navigation

Key Research Questions / Wichtige Forschungsfragen

Performance vs Complexity Trade-off:
- Does learning overhead justify performance gains?
- Online learning vs. offline training?
Generalization:
- How well do learned indexes generalize to unseen queries?
- Performance on out-of-distribution data?
Adaptivity:
- Can indexes adapt to changing data distributions?
- Online updates vs. full retraining?
Interpretability:
- Are learned indexes interpretable?
- Debugging and troubleshooting?
Production Readiness:
- Stability and reliability?
- Hardware requirements (GPU)?

Technical Details / Technische Details

Learning to Hash - Deep Hashing / Tiefes Hashing

Concept:

# Train neural network to map vectors to binary codes
encoder = NeuralNetwork(input_dim=D, output_dim=B)  # D -> B bits
query_code = sign(encoder(query_vector))  # Binary code
# Hamming distance for fast retrieval
distances = hamming_distance(query_code, database_codes)

Advantages:

Compact representation (D dimensions → B bits, e.g., 1024D → 64 bits)
Fast Hamming distance computation (XOR + POPCOUNT)
GPU-friendly

Challenges:

Training requires labeled data or similarity supervision
Binary codes lose fine-grained distance information
Hash collisions

Neural Approximate Nearest Neighbor (NANN) / Neuronale ANN

Concept:

# Train network to predict k-NN directly
predictor = NeuralNetwork(input_dim=D, output_dim=k)
predicted_neighbors = predictor(query_vector)  # Top-k indices
# Refine with exact distance computation if needed

Advantages:

End-to-end optimization for specific dataset
Potentially better than hand-crafted algorithms
Adaptive to data distribution

Challenges:

Requires training data (queries + ground truth neighbors)
Inference cost (forward pass through network)
Model size (can be large for complex datasets)

Learned Space Partitioning / Gelernte Raumpartitionierung

Concept:

# Traditional IVF: k-means clustering
# Learned IVF: Neural network predicts cluster assignments
cluster_predictor = NeuralNetwork(input_dim=D, output_dim=num_clusters)
cluster_probs = softmax(cluster_predictor(query_vector))
# Search top-k clusters with highest probability

Advantages:

Better clustering than k-means for complex distributions
Can learn non-convex cluster boundaries
Soft assignments (multiple clusters per query)

Challenges:

Training cost for large datasets
Balancing cluster sizes
Inference overhead

State-of-the-Art Research / Stand der Forschung

Key Papers / Wichtige Papiere

1. The Case for Learned Index Structures

Authors: Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis
Venue: SIGMOD 2018
Key Innovation:...

Fixes [LEARNED INDEX] #924

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

Initial plan

84a57d1

Copilot AI assigned Copilot and makr-code Feb 1, 2026

Copilot started work on behalf of makr-code February 1, 2026 09:38 View session

Copilot AI and others added 2 commits February 1, 2026 10:17

Add comprehensive Learned Index Structures research documentation

f8e6ed5

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

Fix README.md: consistent use of 🆕 emoji for newest research document

e00b3c6

Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Research learned index structures for ThemisDB~~ Add learned index structures research documentation Feb 1, 2026

Copilot AI requested a review from makr-code February 1, 2026 10:21

Copilot finished work on behalf of makr-code February 1, 2026 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add learned index structures research documentation #984

Add learned index structures research documentation #984

Uh oh!

Copilot AI commented Feb 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add learned index structures research documentation #984

Are you sure you want to change the base?

Add learned index structures research documentation #984

Uh oh!

Conversation

Copilot AI commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Testing

Test Environment

Test Results

Test Commands

Checklist

Code Quality

Documentation

Branch Strategy Compliance

Performance Impact

Breaking Changes

Security Considerations

Additional Notes

Screenshots/Logs

Review Checklist

Merge Strategy

Research Topic / Forschungsthema

Background / Hintergrund

Current Indexing in ThemisDB

Problem Statement / Problemstellung

Traditional Limitations:

Potential Benefits of Learned Indexes:

Research Focus / Forschungsschwerpunkt

Learned Index Approaches / Ansätze für gelernte Indexe

Key Research Questions / Wichtige Forschungsfragen

Technical Details / Technische Details

Learning to Hash - Deep Hashing / Tiefes Hashing

Neural Approximate Nearest Neighbor (NANN) / Neuronale ANN

Learned Space Partitioning / Gelernte Raumpartitionierung

State-of-the-Art Research / Stand der Forschung

Key Papers / Wichtige Papiere

1. The Case for Learned Index Structures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 1, 2026 •

edited

Loading