Skip to content

Conversation

@j-rafique
Copy link
Contributor

No description provided.

@roomote
Copy link

roomote bot commented Jan 13, 2026

Rooviewer Clock   See task on Roo Cloud

Reviewed the latest changes (documentation updates and logging level adjustment). No issues found.

Changes in this update:

  • Documentation updated to reference the PR URL and correct the log level description
  • Logging level for context deadline/canceled errors changed from Debug to Info for better operational visibility
  • Removed verification section from documentation
Previous reviews

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

roomote[bot]
roomote bot previously approved these changes Jan 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses critical P2P replication and routing issues in a Kademlia DHT implementation by introducing bounded replication windows, routing eligibility gating, and reducing log noise.

Changes:

  • Adds chain-state-based routing eligibility checks to prevent non-active supernodes from participating in the DHT routing table
  • Implements bounded replication key scanning (200k keys per cycle) with stable cursor advancement to enable incremental progress on large backlogs
  • Refactors replication key assignment to use one-pass logic instead of per-peer re-scanning, reducing complexity and resource usage

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
p2p/kademlia/bootstrap.go Returns active supernode IDs as allowlist from chain state and triggers routing table pruning
p2p/kademlia/dht.go Implements routing eligibility gating infrastructure and filters ineligible nodes from routing operations
p2p/kademlia/node_activity.go Adds eligibility checks to node activity monitoring and ping success handlers
p2p/kademlia/replication.go Implements bounded key scanning, one-pass key assignment, and stable window-based cursor advancement
p2p/kademlia/store.go Updates interface to accept maxKeys parameter for bounded queries
p2p/kademlia/store/mem/mem.go Updates signature to match new interface
p2p/kademlia/store/sqlite/replication.go Implements bounded query with limit extension for same-timestamp keys and context cancellation checks
p2p/kademlia/store/sqlite/sqlite.go Reduces log noise by treating context timeout/cancellation as Info instead of Error
p2p/kademlia/redundant_data.go Updates call to use unlimited keys for cleanup operations
docs/p2p-replication-and-routing-changes-2026-01-13.md Comprehensive documentation of changes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
}

if err := ctx.Err(); err != nil {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context error check at line 295 occurs after all database operations complete. If the context is cancelled during the database query, the query will continue to completion before this check detects it. Consider checking ctx.Err() before executing queries or use ctx as part of the query execution (if the database driver supports it) to enable early cancellation.

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +274
if len(results) == maxKeys {
boundAt := results[len(results)-1].CreatedAt
boundKey := results[len(results)-1].Key

var extra []domain.KeyWithTimestamp
extraQuery := `SELECT key, createdAt FROM data WHERE createdAt = ? AND key > ? ORDER BY key ASC`
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit extension logic for same-timestamp keys (lines 269-285) is a critical correctness mechanism but lacks test coverage. This logic ensures that when a limit is hit, all keys with the same createdAt as the boundary are included to prevent cursor advancement from skipping keys. Add test cases covering: (1) limit hit with no additional same-timestamp keys, (2) limit hit with multiple additional same-timestamp keys, and (3) proper ordering when keys have identical timestamps.

Copilot uses AI. Check for mistakes.

assignedKeys := make(map[string][]string, len(peerStart))
for i := 0; i < len(replicationKeys); i++ {
decKey, _ := hex.DecodeString(replicationKeys[i].Key)
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hex.DecodeString error is silently ignored on line 269. If the key cannot be decoded, it will silently produce an empty/incorrect byte slice and potentially cause incorrect routing decisions. Consider logging a warning or skipping keys that fail to decode.

Suggested change
decKey, _ := hex.DecodeString(replicationKeys[i].Key)
decKey, err := hex.DecodeString(replicationKeys[i].Key)
if err != nil {
continue
}

Copilot uses AI. Check for mistakes.
if id == "" {
continue
}
if h, err := utils.Blake3Hash([]byte(id)); err == nil && len(h) == 32 {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Blake3Hash error is checked but not logged when it fails. If hashing fails for a supernode ID, that node will be silently excluded from the routing allowlist, which could cause unexpected behavior. Consider logging a warning when hash computation fails for debugging purposes.

Suggested change
if h, err := utils.Blake3Hash([]byte(id)); err == nil && len(h) == 32 {
h, err := utils.Blake3Hash([]byte(id))
if err != nil {
logtrace.Warn(ctx, "Failed to compute Blake3 hash for supernode ID", logtrace.Fields{
logtrace.FieldModule: "p2p",
logtrace.FieldError: err.Error(),
"supernode": sn.SupernodeAccount,
})
} else if len(h) == 32 {

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +144
if os.Getenv("INTEGRATION_TEST") == "true" {
return
}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The INTEGRATION_TEST environment variable check is repeated in multiple functions (setRoutingAllowlist, eligibleForRouting, filterEligibleNodes, pruneIneligibleRoutingPeers, addNode). Consider extracting this to a helper function or storing it as a boolean field in the DHT struct during initialization to avoid repeated environment variable lookups and improve maintainability.

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +149
logtrace.Warn(ctx, "routing allowlist update skipped (empty)", logtrace.Fields{
logtrace.FieldModule: "p2p",
})
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty allowlist check on line 146 prevents lockout but could mask legitimate scenarios where no active supernodes exist in the chain state. Consider differentiating between 'no allowlist available yet' (bootstrap phase) and 'zero active supernodes returned from chain' (potential chain state issue) with different log levels or messages.

Suggested change
logtrace.Warn(ctx, "routing allowlist update skipped (empty)", logtrace.Fields{
logtrace.FieldModule: "p2p",
})
// Differentiate between bootstrap (no allowlist applied yet) and a potential
// chain-state issue (we previously had an allowlist, but now see zero peers).
if !s.routingAllowReady.Load() {
logtrace.Info(ctx, "routing allowlist from chain is empty; assuming bootstrap phase and leaving gating disabled", logtrace.Fields{
logtrace.FieldModule: "p2p",
})
} else {
logtrace.Warn(ctx, "routing allowlist update skipped: chain returned zero active supernodes; retaining previous allowlist", logtrace.Fields{
logtrace.FieldModule: "p2p",
})
}

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +265
peerStart := make(map[string]time.Time)
for _, info := range repInfo {
if !info.Active {
continue
}
start := historicStart
if info.LastReplicatedAt != nil {
start = *info.LastReplicatedAt
}
peerStart[string(info.ID)] = start
}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new one-pass key assignment logic (lines 255-280) represents a significant algorithmic change from the previous per-peer scanning approach. This core replication logic lacks adequate test coverage to verify correctness of key assignment across multiple peers with different lastReplicatedAt values, especially edge cases like keys at window boundaries or peers with no prior replication history.

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +59
if !s.eligibleForRouting(node) {
if info.Active {
s.removeNode(ctx, node)
if uerr := s.store.UpdateIsActive(ctx, string(node.ID), false, false); uerr != nil {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eligibility check followed by removeNode and UpdateIsActive (lines 56-66 and 139-152) could create a race condition if the routing allowlist is updated between the eligibility check and the database update. Consider adding proper synchronization or documenting that this is acceptable given the bootstrap refresh cadence.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants