doublezerod: add periodic kernel route reconciliation#3672
doublezerod: add periodic kernel route reconciliation#3672
Conversation
f31a780 to
bd203a8
Compare
Add a reconciliation loop to the liveness manager that periodically scans the kernel routing table for missing BGP routes and reinstalls them, mitigating connectivity loss caused by external processes removing routes. Also promote liveness session down logs from DEBUG to INFO for passive/peer-passive modes so operators can see the full up/down lifecycle.
Increment RouteInstallFailures counter when a reconciliation reinstall fails, matching the observability pattern in onSessionUp. Also pre-allocate the toCheck slice.
- Re-check installed state under lock before RouteAdd to prevent resurrecting routes intentionally withdrawn by onSessionDown - Add SrcIP to kernel route lookup key for tighter matching in multi-interface setups - Reject negative RouteReconcileInterval in Validate() - Use named const for reconcile interval flag default - Log when route reconciliation is enabled at startup
bd203a8 to
99d373a
Compare
Route Reconciliation Performance Analysis
CPU cost per reconciliation cycle
Estimation methodologyLock hold (step 1): Map iteration over Netlink dump (step 2): Map build + diff (step 3): For each kernel route, we call Amortized CPU: Lock contention with HandleRxThe lock is not held during the expensive netlink syscall (step 2). The snapshot in step 1 holds Practical impact on doublezerod CPU usageGiven a ~3% baseline CPU on a modern x86 core, this change adds effectively zero overhead at realistic route counts (low hundreds). The 1M route case is pathological for a doublezerod client and would have other scaling bottlenecks (BGP convergence, session state memory, netlink install throughput) long before reconciliation matters. |
Resolves: #3669
Summary of Changes
--route-liveness-reconcile-interval), detects BGP routes that should already be installed but are missing, and reinstalls themdoublezero_liveness_route_reinstalls_totalanddoublezero_liveness_route_install_failures_totalPrometheus metrics to track reinstalls and failuresinstalledstate under lock before each reinstall soreconcileRoutescannot resurrect a route thatonSessionDownintentionally withdrew between snapshot and reinstall(table, dst, nexthop)but different source IPs are matched independently in multi-interface setupsDiff Breakdown
Bulk of the change is the reconciliation loop and its tests.
Key files (click to expand)
client/doublezerod/internal/liveness/manager.go—reconcileRoutes()implementation with TOCTOU guard and src-aware kernel key, config field + validation, goroutine launch, startup log, Debug→Info log level changeclient/doublezerod/internal/liveness/manager_test.go— unit tests for route reconciliation (missing route reinstall, present route skip, uninstalled route skip, install failure metric)client/doublezerod/internal/liveness/metrics.go—RouteReinstallscounter androuteReinstallhelperclient/doublezerod/cmd/doublezerod/main.go—--route-liveness-reconcile-intervalflag wiring with named default constTesting Verification
RouteAdderrorNetlinkerto simulate kernel route state; reconciliation ticker set totime.Hourin tests to prevent background interference while callingreconcileRoutes()directlygo vetandgo buildclean