fix: ttl-aware reservation cleanup prevents node isolation#3015
Conversation
|
@claude review this PR |
|
Claude finished @netsirius's task in 3m 45s —— View job Code Review ✓
SummaryThis PR effectively solves the stale reservation problem by making expired reservations invisible rather than requiring explicit cleanup. The TTL-aware approach is cleaner and more robust than the previous Critical Issues1. Race Condition in
|
|
The race condition in should_accept() is pre-existing (not introduced by this PR). What do you think @claude ? This should be handled in a separate issue FYI @iduartgomez |
|
Claude finished @netsirius's task in 1m 3s —— View job Analyzing race condition context
AnalysisYou're correct — the race condition is pre-existing and not introduced by this PR. Here's what I found: Original Code (main branch)let reserved_before = self.pending_reservations.read().len(); // Line 274
// ... logic ...
{
let mut pending = self.pending_reservations.write();
pending.insert(addr, (location, Instant::now())); // Line 306
}
let total_conn = reserved_before.checked_add(1)... // Line 310This PR's ChangesThis PR only changed line 262-267 to add TTL filtering when counting reservations: let now = Instant::now();
let reserved_before = self
.pending_reservations
.read()
.iter()
.filter(|(_, (_, created))| now.duration_since(*created) <= PENDING_RESERVATION_TTL)
.count();The TOCTOU pattern ( RecommendationThis should NOT block the PR. Here's why:
However, I recommend two things:1. Actually, the TTL filtering does improve the race slightlyMy original review incorrectly claimed this PR worsens the race. In fact, the TTL filtering reduces the attack surface:
So if anything, this PR makes a small improvement to the race condition by not counting stale reservations. 2. Create a follow-up issue for the raceI suggest filing a separate issue with this context: Title: Description: ## Problem
`should_accept()` has a TOCTOU race between reading `reserved_before` and inserting the new reservation. Under concurrent `ConnectOp` load (via `process_message` task spawning), multiple threads can:
1. Read the same `reserved_before` count (e.g., 18)
2. Each calculate `total_conn = 18 + 1 + open = 19 < max_connections`
3. All accept, causing `pending_reservations.len()` to exceed `max_connections`
## Impact
- Brief violations of `max_connections` capacity limit on gateways
- Potential resource exhaustion under sustained concurrent connection attempts
- Severity: Medium (affects robustness, not correctness)
## Solution
Move reservation insertion before capacity check, or hold write lock during entire decision:
```rust
// Option 1: Insert first, check after
{
let mut pending = self.pending_reservations.write();
if pending.contains_key(&addr) { return true; }
pending.insert(addr, (location, Instant::now()));
}
let reserved_after = self.pending_reservations.read().len();
let total_conn = reserved_after.checked_add(open)?;
// ... capacity checks, cleanup on rejection ...Context
|
|
@netsirius lets open an issue for the race condition before merging this, and then we can merge this if it is good |
|
If we can have the suggested sim network tests that would be good. |
|
@iduartgomez I just added sim network integration tests for the retry-after-TTL scenario Two integration tests added to simulation_smoke.rs with supporting methods on SimNetwork:
|
f7329c6 to
4aadbac
Compare
Problem
When a
ConnectOpreserves a gateway slot viashould_accept()but the actual connection fails (e.g., NAT traversal failure, blocked acceptor), the pending reservation persists indefinitely.has_connection_or_pending()treats stale reservations as active connections, causingis_not_connected()to filter out the gateway on retry. This creates permanent node isolation — the node believes it's connected to the gateway but has no actual connection.A previous workaround bypassed
is_not_connected()filtering whenopen_conns == 0, but this was fragile: it only helped fully isolated nodes and ignored partial isolation where some connections exist but a stale reservation blocks reconnection to a specific gateway.Solution
Replace the
open_conns == 0workaround with TTL-aware reservation visibility.has_connection_or_pending()now ignores reservations older thanPENDING_RESERVATION_TTL(60s), making stale entries invisible without requiring explicit cleanup. This handles all isolation scenarios uniformly — both full and partial.Changes
connection_manager.rs— TTL check inhas_connection_or_pending()inject_reservation()method for test state injectionage_reservationtest helper for clean TTL manipulationconnect.rs— Removedopen_conns == 0overrideis_not_connected()filtering now applies uniformly since TTL handles stalenesstesting_impl.rs— SimNetwork infrastructure for connection state inspectionConnectionManagerexposed per node viashared_cmrendezvous patternhas_connection_or_pending(),inject_stale_reservation(),connection_count()accessorssimulation_smoke.rs— Integration testtest_stale_reservation_ttl_expiry: verifies fresh reservations are visible, expired ones are invisible, and established connections are unaffectedFixes
Closes #2888