Skip to content

Improve markNodeDeleted: O(DELETION_LD) in-place repair via Algorithm 5 + Algorithm 6 (IP-DiskANN) #646

@Kartikk1127

Description

@Kartikk1127

Problem

The current markNodeDeleted + cleanup() workflow has O(N) cost per deletion:

  • markNodeDeleted only flips a bit in the deleted set
  • cleanup() scans every node in the graph via nodeStream to find in-neighbors
    of the deleted node, then rebuilds their neighbor lists

This means deletion cost degrades linearly as the graph grows, and crucially it
grows over time as more deletions accumulate.

Solution

The IP-DiskANN paper (arXiv:2502.13826) describes two algorithms that solve this:

Algorithm 5 — In-place deletion repair:
Instead of scanning all N nodes to find in-neighbors, run a GreedySearch toward
the deleted node's vector. Nodes the search visits are the approximate in-neighbors.
This reduces in-neighbor discovery from O(N) to O(DELETION_LD) where DELETION_LD
is the beam width of the search.

The sequence per deletion:

  1. Flip the deleted bit
  2. Update entry point if deleted node is the current entry point
  3. GreedySearch toward deleted node's vector with beam width DELETION_LD
  4. For each visited node z where z → deleted_node exists: repair z's neighbor
    list using the top-DELETION_LD search results as replacement candidates
  5. Physically remove the node

Algorithm 6 — Dangling edge sweep:
Algorithm 5 repairs in-neighbors found via the search path, but greedy search
is approximate and may miss some. Algorithm 6 is a periodic O(N × M) sweep
(no distance calculations) that removes any remaining out-edges pointing to
absent nodes.

Benchmark Results (SIFT-1M, M=16, efConstruction=200, efSearch=200)

100K deletions (10% of index), 1000 query vectors, topK=10:

Batch Deleted Avg/delete Recall@10 Degradation
1/10 10,000 1.55ms 0.9517 0.17%
2/10 20,000 1.54ms 0.9508 0.26%
5/10 50,000 1.52ms 0.9459 0.75%
10/10 100,000 1.49ms 0.9279 2.55%

Baseline recall: 0.9534 → Post-deletion recall: 0.9279 (2.55% degradation)

Key observations:

  • Per-deletion cost is flat across all 10 batches (1.49–1.55ms) — does not grow as deletions accumulate
  • Old O(N) nodeStream approach: ~23ms/delete initially, growing to ~39ms — and compounding
  • 100K deletions completed in 2.5 minutes vs projected ~60 minutes with old approach
  • Recall degradation stays well within acceptable bounds at 10% deletion rate

API changes

markNodeDeleted becomes self-contained — no cleanup() call needed after deletion.
cleanup() is still required before writing to disk.

consolidateDanglingEdges() is a new public method for Algorithm 6 execution.

Implementation

I have a working implementation with all existing tests passing (mvn test -Pjdk21
and mvn -Pjdk11 test). Here's the PR for it: #648

References

@marianotepper

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions