Skip to content

doublezerod: add periodic kernel route reconciliation#3672

Open
nikw9944 wants to merge 4 commits intomainfrom
nikw9944/doublezero-3669
Open

doublezerod: add periodic kernel route reconciliation#3672
nikw9944 wants to merge 4 commits intomainfrom
nikw9944/doublezero-3669

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented May 5, 2026

Resolves: #3669

Summary of Changes

  • Add a periodic route reconciliation goroutine to the liveness manager that scans the kernel routing table every 30s (configurable via --route-liveness-reconcile-interval), detects BGP routes that should already be installed but are missing, and reinstalls them
  • This mitigates the case where another process or an administrator mistakenly deletes a doublezero route from the kernel routing table
  • Add doublezero_liveness_route_reinstalls_total and doublezero_liveness_route_install_failures_total Prometheus metrics to track reinstalls and failures
  • Prevent TOCTOU race in active mode: re-check installed state under lock before each reinstall so reconcileRoutes cannot resurrect a route that onSessionDown intentionally withdrew between snapshot and reinstall
  • Include source IP in the kernel route lookup key so routes with the same (table, dst, nexthop) but different source IPs are matched independently in multi-interface setups
  • Promote "session down (passive; keeping route)" log messages from Debug to Info, otherwise it's possible to log multiple 'liveness: session up' messages in a row

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 1 +116 / -2 +114
Scaffolding 2 +30 / -13 +17
Tests 1 +124 / -0 +124
Docs 1 +2 / -0 +2
Total 5 +272 / -15 +257

Bulk of the change is the reconciliation loop and its tests.

Key files (click to expand)

Testing Verification

  • Unit tests cover: reinstalling a missing route, skipping a route present in kernel, skipping an uninstalled route (active mode, session never went Up), and incrementing the install failure metric on RouteAdd error
  • Tests use mock Netlinker to simulate kernel route state; reconciliation ticker set to time.Hour in tests to prevent background interference while calling reconcileRoutes() directly
  • Full liveness test suite passes (40s, all existing tests unaffected)
  • go vet and go build clean

@nikw9944 nikw9944 marked this pull request as ready for review May 5, 2026 21:07
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3669 branch from f31a780 to bd203a8 Compare May 5, 2026 21:37
nikw9944 added 4 commits May 5, 2026 21:49
Add a reconciliation loop to the liveness manager that periodically
scans the kernel routing table for missing BGP routes and reinstalls
them, mitigating connectivity loss caused by external processes
removing routes.

Also promote liveness session down logs from DEBUG to INFO for
passive/peer-passive modes so operators can see the full up/down
lifecycle.
Increment RouteInstallFailures counter when a reconciliation reinstall
fails, matching the observability pattern in onSessionUp. Also
pre-allocate the toCheck slice.
- Re-check installed state under lock before RouteAdd to prevent
  resurrecting routes intentionally withdrawn by onSessionDown
- Add SrcIP to kernel route lookup key for tighter matching in
  multi-interface setups
- Reject negative RouteReconcileInterval in Validate()
- Use named const for reconcile interval flag default
- Log when route reconciliation is enabled at startup
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3669 branch from bd203a8 to 99d373a Compare May 5, 2026 21:50
@nikw9944
Copy link
Copy Markdown
Contributor Author

nikw9944 commented May 6, 2026

Route Reconciliation Performance Analysis

reconcileRoutes() runs every 30s with 4 phases:

  1. Lock → snapshot → unlock — iterates installed ∩ desired into a local slice
  2. Netlink dump (no lock held)RouteListFiltered(FAMILY_V4, RTPROT_BGP) dumps all kernel BGP routes
  3. Build hash set + diff — builds map[kernelKey]struct{} from kernel routes, checks each installed route against it
  4. Reinstall missing — brief re-lock per missing route, then RouteAdd

CPU cost per reconciliation cycle

Routes Lock hold (step 1) Netlink dump (step 2) Map build + diff (step 3) Total per cycle Amortized over 30s
100 ~1 μs ~0.5 ms ~50 μs ~0.6 ms 0.002%
1,000 ~5 μs ~3 ms ~0.5 ms ~4 ms 0.013%
1,000,000 ~5 ms ~1-2 s ~300-500 ms ~2-3 s ~7-10%

Estimation methodology

Lock hold (step 1): Map iteration over installed checking membership in desired, appending matches to a slice. Go map iteration is ~10ns/entry. 100 entries × 10ns = 1μs; 1000 × 10ns = 10μs (rounded to ~5μs accounting for the append being fast). At 1M entries, iteration + slice growth + memory allocation dominates: ~5ms.

Netlink dump (step 2): RouteListFiltered issues a single RTM_GETROUTE netlink dump; the kernel filters by protocol server-side and streams matching routes back. Each route message is ~100-200 bytes of netlink payload. At 100 routes that's ~10-20KB parsed by vishvananda/netlink — well under 1ms. At 1,000 routes, ~100-200KB, ~3ms. At 1M routes, ~100-200MB of netlink data to receive and deserialize — estimated 1-2s based on netlink socket throughput (~100-200MB/s for dump operations).

Map build + diff (step 3): For each kernel route, we call IP.To4().String() on 3 fields (dst, nexthop, src) creating heap-allocated strings, then insert into map[kernelKey]struct{}. String conversion + map insert is ~200-500ns/route. At 100 routes: ~50μs. At 1,000: ~500μs. At 1M: ~300-500ms (dominated by allocations and GC pressure from ~3M short-lived strings). The diff itself is O(installed) map lookups at O(1) each — negligible relative to the build.

Amortized CPU: total_per_cycle / 30s. E.g., at 1,000 routes: 4ms / 30,000ms = 0.013%.

Lock contention with HandleRx

The lock is not held during the expensive netlink syscall (step 2). The snapshot in step 1 holds m.mu only during map iteration. HandleRx (the hot path) also holds m.mu for its duration, so these can contend — but at 100-1000 routes, a 1-5μs delay is negligible compared to the 50ms-3s liveness TX intervals. At 1M routes the ~5ms lock hold becomes measurable but is still small relative to liveness timers.

Practical impact on doublezerod CPU usage

Given a ~3% baseline CPU on a modern x86 core, this change adds effectively zero overhead at realistic route counts (low hundreds). The 1M route case is pathological for a doublezerod client and would have other scaling bottlenecks (BGP convergence, session state memory, netlink install throughput) long before reconciliation matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Route installed by doublezerod removed by unknown process

1 participant