-
Notifications
You must be signed in to change notification settings - Fork 0
Bootstrap and replication fixes #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewed the latest changes (documentation updates and logging level adjustment). No issues found. Changes in this update:
Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses critical P2P replication and routing issues in a Kademlia DHT implementation by introducing bounded replication windows, routing eligibility gating, and reducing log noise.
Changes:
- Adds chain-state-based routing eligibility checks to prevent non-active supernodes from participating in the DHT routing table
- Implements bounded replication key scanning (200k keys per cycle) with stable cursor advancement to enable incremental progress on large backlogs
- Refactors replication key assignment to use one-pass logic instead of per-peer re-scanning, reducing complexity and resource usage
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| p2p/kademlia/bootstrap.go | Returns active supernode IDs as allowlist from chain state and triggers routing table pruning |
| p2p/kademlia/dht.go | Implements routing eligibility gating infrastructure and filters ineligible nodes from routing operations |
| p2p/kademlia/node_activity.go | Adds eligibility checks to node activity monitoring and ping success handlers |
| p2p/kademlia/replication.go | Implements bounded key scanning, one-pass key assignment, and stable window-based cursor advancement |
| p2p/kademlia/store.go | Updates interface to accept maxKeys parameter for bounded queries |
| p2p/kademlia/store/mem/mem.go | Updates signature to match new interface |
| p2p/kademlia/store/sqlite/replication.go | Implements bounded query with limit extension for same-timestamp keys and context cancellation checks |
| p2p/kademlia/store/sqlite/sqlite.go | Reduces log noise by treating context timeout/cancellation as Info instead of Error |
| p2p/kademlia/redundant_data.go | Updates call to use unlimited keys for cleanup operations |
| docs/p2p-replication-and-routing-changes-2026-01-13.md | Comprehensive documentation of changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| } | ||
|
|
||
| if err := ctx.Err(); err != nil { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context error check at line 295 occurs after all database operations complete. If the context is cancelled during the database query, the query will continue to completion before this check detects it. Consider checking ctx.Err() before executing queries or use ctx as part of the query execution (if the database driver supports it) to enable early cancellation.
| if len(results) == maxKeys { | ||
| boundAt := results[len(results)-1].CreatedAt | ||
| boundKey := results[len(results)-1].Key | ||
|
|
||
| var extra []domain.KeyWithTimestamp | ||
| extraQuery := `SELECT key, createdAt FROM data WHERE createdAt = ? AND key > ? ORDER BY key ASC` |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit extension logic for same-timestamp keys (lines 269-285) is a critical correctness mechanism but lacks test coverage. This logic ensures that when a limit is hit, all keys with the same createdAt as the boundary are included to prevent cursor advancement from skipping keys. Add test cases covering: (1) limit hit with no additional same-timestamp keys, (2) limit hit with multiple additional same-timestamp keys, and (3) proper ordering when keys have identical timestamps.
|
|
||
| assignedKeys := make(map[string][]string, len(peerStart)) | ||
| for i := 0; i < len(replicationKeys); i++ { | ||
| decKey, _ := hex.DecodeString(replicationKeys[i].Key) |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hex.DecodeString error is silently ignored on line 269. If the key cannot be decoded, it will silently produce an empty/incorrect byte slice and potentially cause incorrect routing decisions. Consider logging a warning or skipping keys that fail to decode.
| decKey, _ := hex.DecodeString(replicationKeys[i].Key) | |
| decKey, err := hex.DecodeString(replicationKeys[i].Key) | |
| if err != nil { | |
| continue | |
| } |
| if id == "" { | ||
| continue | ||
| } | ||
| if h, err := utils.Blake3Hash([]byte(id)); err == nil && len(h) == 32 { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Blake3Hash error is checked but not logged when it fails. If hashing fails for a supernode ID, that node will be silently excluded from the routing allowlist, which could cause unexpected behavior. Consider logging a warning when hash computation fails for debugging purposes.
| if h, err := utils.Blake3Hash([]byte(id)); err == nil && len(h) == 32 { | |
| h, err := utils.Blake3Hash([]byte(id)) | |
| if err != nil { | |
| logtrace.Warn(ctx, "Failed to compute Blake3 hash for supernode ID", logtrace.Fields{ | |
| logtrace.FieldModule: "p2p", | |
| logtrace.FieldError: err.Error(), | |
| "supernode": sn.SupernodeAccount, | |
| }) | |
| } else if len(h) == 32 { |
| if os.Getenv("INTEGRATION_TEST") == "true" { | ||
| return | ||
| } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The INTEGRATION_TEST environment variable check is repeated in multiple functions (setRoutingAllowlist, eligibleForRouting, filterEligibleNodes, pruneIneligibleRoutingPeers, addNode). Consider extracting this to a helper function or storing it as a boolean field in the DHT struct during initialization to avoid repeated environment variable lookups and improve maintainability.
| logtrace.Warn(ctx, "routing allowlist update skipped (empty)", logtrace.Fields{ | ||
| logtrace.FieldModule: "p2p", | ||
| }) |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The empty allowlist check on line 146 prevents lockout but could mask legitimate scenarios where no active supernodes exist in the chain state. Consider differentiating between 'no allowlist available yet' (bootstrap phase) and 'zero active supernodes returned from chain' (potential chain state issue) with different log levels or messages.
| logtrace.Warn(ctx, "routing allowlist update skipped (empty)", logtrace.Fields{ | |
| logtrace.FieldModule: "p2p", | |
| }) | |
| // Differentiate between bootstrap (no allowlist applied yet) and a potential | |
| // chain-state issue (we previously had an allowlist, but now see zero peers). | |
| if !s.routingAllowReady.Load() { | |
| logtrace.Info(ctx, "routing allowlist from chain is empty; assuming bootstrap phase and leaving gating disabled", logtrace.Fields{ | |
| logtrace.FieldModule: "p2p", | |
| }) | |
| } else { | |
| logtrace.Warn(ctx, "routing allowlist update skipped: chain returned zero active supernodes; retaining previous allowlist", logtrace.Fields{ | |
| logtrace.FieldModule: "p2p", | |
| }) | |
| } |
| peerStart := make(map[string]time.Time) | ||
| for _, info := range repInfo { | ||
| if !info.Active { | ||
| continue | ||
| } | ||
| start := historicStart | ||
| if info.LastReplicatedAt != nil { | ||
| start = *info.LastReplicatedAt | ||
| } | ||
| peerStart[string(info.ID)] = start | ||
| } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new one-pass key assignment logic (lines 255-280) represents a significant algorithmic change from the previous per-peer scanning approach. This core replication logic lacks adequate test coverage to verify correctness of key assignment across multiple peers with different lastReplicatedAt values, especially edge cases like keys at window boundaries or peers with no prior replication history.
| if !s.eligibleForRouting(node) { | ||
| if info.Active { | ||
| s.removeNode(ctx, node) | ||
| if uerr := s.store.UpdateIsActive(ctx, string(node.ID), false, false); uerr != nil { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eligibility check followed by removeNode and UpdateIsActive (lines 56-66 and 139-152) could create a race condition if the routing allowlist is updated between the eligibility check and the database update. Consider adding proper synchronization or documenting that this is acceptable given the bootstrap refresh cadence.
No description provided.