Skip to content

Self healing#273

Open
j-rafique wants to merge 1 commit intomasterfrom
self-healing
Open

Self healing#273
j-rafique wants to merge 1 commit intomasterfrom
self-healing

Conversation

@j-rafique
Copy link
Copy Markdown
Contributor

No description provided.

@roomote-v0
Copy link
Copy Markdown

roomote-v0 Bot commented Feb 23, 2026

Rooviewer Checkmark   View task

Reviewed commit fb202ed ("Self healing"). The self-healing feature is well-structured with a clean three-phase protocol (Request/Verify/Commit), solid test coverage across unit, E2E, and system tests, and good use of lease-claim event processing with exponential backoff. Previous review items on rebalance_worker.go are no longer applicable -- that file was removed in this refactor.

New items from this review:

  • Typo: verificationTimout constant (line 64) and its usage (line 646) in supernode/self_healing/service.go -- should be verificationTimeout
  • Duplication: parseHostAndPort is duplicated between supernode/self_healing/service.go:1098 and supernode/storage_challenge/service.go:498 -- the storage-challenge version handles more edge cases (IPv6, bracketed literals). Extract to a shared package.
  • Duplication: pickActionAnchorKey is duplicated verbatim in supernode/self_healing/service.go:950 and supernode/transport/grpc/self_healing/handler.go:668. Extract to a shared location.
  • Dead code: _ = ctx on lines 378 and 395 of supernode/self_healing/service.go -- ctx is a function parameter and won't trigger unused-variable warnings. Safe to remove.

Previous items (no longer applicable -- rebalance_worker.go removed in this refactor):

  • Rebalance worker can delete original copies (is_original=true) without guarding against it
  • Indentation mismatch in the new IsOriginal guard and surrounding delete-confirm block
Previous reviews

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces “self-healing” behavior in the P2P DHT by adding a rebalance worker and new probe/status plumbing, while also refining chain-state eligibility gating (routing vs store/write) and bootstrap behavior to better recover from transient chain/API issues.

Changes:

  • Add a rebalance background worker that scans local keys, heals under-replicated keys, and deletes redundant replicas when safe.
  • Introduce side-effect-free local key status APIs (RetrieveBatchLocalStatus, ListLocalKeysPage) plus a new network message (BatchProbeKeys) to probe key presence across peers.
  • Split chain-state eligibility into routing-eligible vs store-eligible allowlists (Active+Postponed vs Active-only), and adjust bootstrap + batch-store selection accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
supernode/adaptors/lumera.go Forces top-supernode selection to be Active-only for write/finalization eligibility.
p2p/kademlia/store/sqlite/replication.go Adds sqlite implementations for local key status + paged key listing.
p2p/kademlia/store/mem/mem.go Adds mem-store implementations for local key status + paged key listing (currently inconsistent with hex-key expectations).
p2p/kademlia/store.go Extends store interface with side-effect-free local key status APIs and defines LocalKeyStatus.
p2p/kademlia/rebalance_worker.go New rebalance worker that probes holders, heals replicas, and deletes redundant local copies after confirmation.
p2p/kademlia/node_activity.go Refines node activity handling to distinguish routing eligibility vs store/replication eligibility.
p2p/kademlia/network.go Adds BatchProbeKeys handler + timeouts/dispatch wiring.
p2p/kademlia/message.go Adds new gob-registered message types for BatchProbeKeys.
p2p/kademlia/dht_batch_store_test.go Updates expected error text for new “eligible store peers” behavior.
p2p/kademlia/dht.go Adds store allowlist, store-eligibility gating in write paths, starts rebalance worker, and adjusts batch-store candidate selection.
p2p/kademlia/bootstrap.go Splits routing vs store allowlists from chain state (Active+Postponed vs Active-only) and makes bootstrap sync more resilient with periodic retries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread p2p/kademlia/store/mem/mem.go Outdated
Comment thread p2p/kademlia/store/mem/mem.go Outdated
Comment thread p2p/kademlia/rebalance_worker.go Outdated
Comment thread p2p/kademlia/network.go Outdated
Comment thread p2p/kademlia/rebalance_worker.go Outdated
Comment thread p2p/kademlia/rebalance_worker.go Outdated
Comment thread p2p/kademlia/rebalance_worker.go Outdated
Copy link
Copy Markdown

@roomote-v0 roomote-v0 Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incremental review of changes since 3bf7e44. The self-healing feature is well-structured with solid test coverage. A few items worth addressing -- mostly around code duplication and a typo.

defaultEventWorkers = 8

requestTimeout = 20 * time.Second
verificationTimout = 20 * time.Second
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit / typo: verificationTimout is missing an e -- should be verificationTimeout.

Suggested change
verificationTimout = 20 * time.Second
verificationTimeout = 20 * time.Second

return client.CommitSelfHealing(cctx, req)
}

func parseHostAndPort(address string, defaultPort int) (host string, port int, ok bool) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplication: parseHostAndPort is duplicated here and in supernode/storage_challenge/service.go:498. The storage-challenge version is more robust (handles IPv6 bracketed literals, zone IDs, stray-bracket rejection). This copy silently drops those edge cases.

Consider extracting the storage-challenge version into a shared internal package (e.g. pkg/netutil) and importing it from both call sites.

Comment thread supernode/self_healing/service.go
Comment thread supernode/self_healing/service.go
Comment thread supernode/self_healing/service.go
Copy link
Copy Markdown

@roomote-v0 roomote-v0 Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The self-healing feature is well-architected with a clean three-phase protocol, solid test coverage, and good defensive coding patterns. The items flagged (typo, code duplication, dead code) are all minor and non-blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants