Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Feb 1, 2026

Description

Comprehensive research documentation for learned index structures in vector search. Evaluates ML-based alternatives to HNSW: neural k-NN predictors, learned hash functions (SONG), GNN-enhanced navigation, and hybrid approaches. Includes implementation roadmap, benchmarks, and integration with existing ThemisDB components (LearnedQuantizer, LoRA-RAID, GPU infrastructure).

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • ♻️ Code refactoring (no functional changes)
  • ⚡ Performance improvement
  • ✅ Test addition or update
  • 🔧 Configuration change
  • 🎨 UI/UX change

Related Issues

Changes Made

New Research Document (LEARNED_INDEX_STRUCTURES_RESEARCH.md - 105KB, 3,505 lines):

  • Current State Analysis: HNSW performance characteristics, LearnedQuantizer capabilities, GPU support baseline
  • Five Learned Approaches:
    • Neural Approximate Nearest Neighbor (NANN) - end-to-end k-NN prediction
    • Learning to Hash - SONG (2-5x faster than FAISS-GPU), HashNet, DPSH
    • Learned Space Partitioning - ScaNN, neural IVF optimization
    • GNN-Enhanced HNSW - learned routing policies (+5-10% recall)
    • Hybrid architectures with traditional fallback
  • Implementation Roadmap: 5-phase plan (6-12 months), resource estimates ($150k-200k), Go/No-Go criteria (≥10% improvement)
  • Benchmark Framework: Datasets (SIFT1M, GIST1M, Deep1B), metrics (recall@k, latency, memory), baselines
  • Integration Design: C++ API with LibTorch/ONNX Runtime, training pipeline, monitoring, model versioning
  • Production Considerations: 10 risks with mitigation, hybrid fallback strategies, GPU requirements
  • State-of-the-Art: 15+ papers (Kraska SIGMOD'18, Prokhorenkova KDD'20, Jaiswal NeurIPS'22)

Updated Documentation:

  • docs/research/README.md: Added new entry, updated changelog to v3.1

Testing

Test Environment

  • OS: N/A (Documentation only)
  • Compiler: N/A
  • Build Type: N/A

Test Results

  • All existing tests pass
  • New tests added for changes
  • Manual testing performed

Test Commands

# No code changes - documentation only

Checklist

  • My code follows the coding standards
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

Code Quality

  • Code builds without errors
  • Code builds without warnings
  • Static analysis (cppcheck) passes
  • No memory leaks detected
  • Code follows C++17 standards

Documentation

  • README.md updated (if applicable)
  • CHANGELOG.md updated
  • API documentation updated (if applicable)
  • Code comments added/updated

Branch Strategy Compliance

  • PR targets the correct branch (develop for features, main for releases/hotfixes)
  • Branch naming follows convention (e.g., feature/, bugfix/, hotfix/, release/)
  • No direct commits to main or develop

Performance Impact

  • No significant performance impact
  • Performance improvement (describe below)
  • Performance regression (justify below)

Performance Notes:
Documentation-only change. Performance implications discussed theoretically in research document (2-5x potential speedup with SONG, 5-10% recall improvement with GNN-enhanced navigation).

Breaking Changes

No breaking changes.

Security Considerations

  • No security implications
  • Security review required
  • Dependencies updated to secure versions

Additional Notes

Document Scope: Connects existing ThemisDB capabilities (LearnedQuantizer quantization, HNSW graph structure, LoRA-RAID multi-GPU, CUDA/HIP) with learned index literature. Provides actionable decision framework for Phase 1 prototype (3 months).

Key Decision Points:

  • Start with GNN-enhanced HNSW routing (lowest risk, 5-10% gain)
  • Evaluate SONG and deep hashing in parallel
  • Go/No-Go after benchmarks on SIFT1M/GIST1M (≥10% improvement threshold)

Synergies: Aligns with existing GNN research (GNN_BASED_INDEXING_AND_EMBEDDINGS.md), leverages production ML infrastructure.

Screenshots/Logs

N/A - Documentation only


For Maintainers:

Review Checklist

  • Code quality acceptable
  • Tests adequate
  • Documentation complete
  • No security concerns
  • Ready to merge

Merge Strategy

  • Squash and merge (✅ Recommended for feature/bugfix PRs - cleaner history)
  • Merge commit (Only for release/hotfix branches)
  • Rebase and merge
Original prompt

This section details on the original issue you should resolve

<issue_title>[LEARNED INDEX]</issue_title>
<issue_description>## Learned Index Structures Research / Forschung zu gelernten Indexstrukturen

Research Topic / Forschungsthema

Background / Hintergrund

Current Indexing in ThemisDB

  • Method:
  • Type: [ ] Traditional/Algorithmic [ ] Learned/Neural [ ] Hybrid
  • Query Performance:
  • Index Size:

Problem Statement / Problemstellung

  • Traditional Limitations:

  • Potential Benefits of Learned Indexes:

Research Focus / Forschungsschwerpunkt

Learned Index Approaches / Ansätze für gelernte Indexe

  • Neural Approximate Nearest Neighbor (NANN)

    • Train neural networks to predict ANN results
    • Papers: Plaut & Roughgarden (NeurIPS 2020)
    • Expected benefit: Adaptive to data distribution
  • Learning to Hash

    • Deep hashing with neural networks
    • Papers: Cao et al. (CVPR 2017), Liu et al. (IJCAI 2018)
    • Expected benefit: Compact binary codes, fast retrieval
  • Learned Space Partitioning

    • Neural networks for clustering/partitioning
    • Papers: Kraska et al. (SIGMOD 2018), Ferragina & Vinciguerra (VLDB 2020)
    • Expected benefit: Better space partitioning than k-means
  • End-to-End Learned Vector Search

    • Differentiable indexes trained end-to-end
    • Papers: Xu et al. (ICML 2021), Jaiswal et al. (NeurIPS 2022)
    • Expected benefit: Optimized for specific workload
  • Hybrid Learned/Traditional Indexes

    • Combine learned components with traditional indexes
    • Papers: Davitkova et al. (VLDB 2021)
    • Expected benefit: Best of both worlds
  • Graph Neural Networks for Vector Search

    • GNN-based navigation in vector graphs
    • Papers: Prokhorenkova et al. (KDD 2020)
    • Expected benefit: Better HNSW-style graph navigation

Key Research Questions / Wichtige Forschungsfragen

  1. Performance vs Complexity Trade-off:

    • Does learning overhead justify performance gains?
    • Online learning vs. offline training?
  2. Generalization:

    • How well do learned indexes generalize to unseen queries?
    • Performance on out-of-distribution data?
  3. Adaptivity:

    • Can indexes adapt to changing data distributions?
    • Online updates vs. full retraining?
  4. Interpretability:

    • Are learned indexes interpretable?
    • Debugging and troubleshooting?
  5. Production Readiness:

    • Stability and reliability?
    • Hardware requirements (GPU)?

Technical Details / Technische Details

Learning to Hash - Deep Hashing / Tiefes Hashing

Concept:

# Train neural network to map vectors to binary codes
encoder = NeuralNetwork(input_dim=D, output_dim=B)  # D -> B bits
query_code = sign(encoder(query_vector))  # Binary code
# Hamming distance for fast retrieval
distances = hamming_distance(query_code, database_codes)

Advantages:

  • Compact representation (D dimensions → B bits, e.g., 1024D → 64 bits)
  • Fast Hamming distance computation (XOR + POPCOUNT)
  • GPU-friendly

Challenges:

  • Training requires labeled data or similarity supervision
  • Binary codes lose fine-grained distance information
  • Hash collisions

Neural Approximate Nearest Neighbor (NANN) / Neuronale ANN

Concept:

# Train network to predict k-NN directly
predictor = NeuralNetwork(input_dim=D, output_dim=k)
predicted_neighbors = predictor(query_vector)  # Top-k indices
# Refine with exact distance computation if needed

Advantages:

  • End-to-end optimization for specific dataset
  • Potentially better than hand-crafted algorithms
  • Adaptive to data distribution

Challenges:

  • Requires training data (queries + ground truth neighbors)
  • Inference cost (forward pass through network)
  • Model size (can be large for complex datasets)

Learned Space Partitioning / Gelernte Raumpartitionierung

Concept:

# Traditional IVF: k-means clustering
# Learned IVF: Neural network predicts cluster assignments
cluster_predictor = NeuralNetwork(input_dim=D, output_dim=num_clusters)
cluster_probs = softmax(cluster_predictor(query_vector))
# Search top-k clusters with highest probability

Advantages:

  • Better clustering than k-means for complex distributions
  • Can learn non-convex cluster boundaries
  • Soft assignments (multiple clusters per query)

Challenges:

  • Training cost for large datasets
  • Balancing cluster sizes
  • Inference overhead

State-of-the-Art Research / Stand der Forschung

Key Papers / Wichtige Papiere

1. The Case for Learned Index Structures

  • Authors: Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis Polyzotis
  • Venue: SIGMOD 2018
  • Key Innovation:...

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits February 1, 2026 10:17
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Co-authored-by: makr-code <150588092+makr-code@users.noreply.github.com>
Copilot AI changed the title [WIP] Research learned index structures for ThemisDB Add learned index structures research documentation Feb 1, 2026
Copilot AI requested a review from makr-code February 1, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[LEARNED INDEX]

2 participants