Add exponential backoff to Fleet Desktop server communications#45623
Draft
sharon-fdm wants to merge 3 commits into
Draft
Add exponential backoff to Fleet Desktop server communications#45623sharon-fdm wants to merge 3 commits into
sharon-fdm wants to merge 3 commits into
Conversation
Fleet Desktop now backs off exponentially (with jitter) when receiving errors from the Fleet server, preventing request storms that can overwhelm the database. This addresses corrective action #3 from the #44816 postmortem, where expired-token polling without backoff caused a DB outage. Changes: - New `orbit/pkg/backoff` package: stateful tracker with exponential doubling, jitter, max cap, and per-path isolation - Integrated into Desktop's ping loop and checkToken retry loop - 13 unit tests covering all Oracle scenarios from #45553 Closes #45553
This was referenced May 15, 2026
Three new tests that verify backoff behavior with actual time.Ticker instances and wall-clock measurements, matching how Desktop uses the tracker in its polling loop: - TestTickerIntegration: 4 consecutive failures with measured growing intervals, then success resets to base - TestTickerIntegrationMaxCap: verifies ticker caps at maxBackoff - TestMultipleTrackersWithTickers: per-path isolation with real tickers
Three tests that verify backoff against real HTTP servers with togglable error responses, connection-refused scenarios, and measured wall-clock timing: - TestManualBackoffAgainstHTTPServer: full lifecycle (healthy -> 401 errors -> recovery -> 500 errors -> recovery) with real HTTP round-trips - TestManualBackoffServerDown: connection-refused backoff - TestManualBackoffMaxCapWithRealServer: continuous 401s until cap Also manually tested against a live Fleet server: - Invalid token -> 401 backoff: 500ms -> 1s -> 2.2s -> 4s -> 8s -> 10s (cap) - Server killed mid-poll -> seamless transition to network-error backoff - Recovery -> instant reset to base interval
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #45623 +/- ##
==========================================
+ Coverage 66.73% 66.76% +0.02%
==========================================
Files 2732 2740 +8
Lines 218551 219009 +458
Branches 10840 10840
==========================================
+ Hits 145857 146211 +354
- Misses 59479 59573 +94
- Partials 13215 13225 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #45624
Closes #45625
Summary
orbit/pkg/backoffpackage: stateful exponential backoff tracker with jitter, thread-safe, per-path isolationcheckTokenretry loopAddresses corrective action #3 from the #44816 postmortem: Fleet Desktop kept polling
/device/{token}/desktopat full rate with expired tokens, overwhelming the DB. With backoff, 250 hosts in this state produce ~0.14 req/s instead of 25 req/s (180x reduction).Manual testing performed
1. Unit tests (19 total, all pass)
13 logic tests: exponential doubling, cap at max, jitter range, reset on success, per-path isolation, concurrent access, no-give-up, monotonically non-decreasing intervals.
3 real-time ticker tests: actual
time.Tickerwith wall-clock measurements verifying intervals grow during failures and reset on success, matching how Desktop uses the tracker.3 HTTP server integration tests: real
httptest.TLSServertoggled between 200/401/500 mid-test, plus connection-refused scenario. Verified full lifecycle:2. Live Fleet server testing
Started Fleet dev server (
FLEET_MYSQL_PASSWORD=fleet ./build/fleet serve --dev --dev_license), enrolled orbit, then ran test binary against real/device/{token}/desktopendpoint.Invalid token (simulating #44816 expired-token scenario):
Server killed mid-poll (simulating server outage):
Seamless transition from HTTP errors to network errors -- same backoff behavior.
Recovery after server restart:
3. Build verification
Limitation
Fleet Desktop requires a GUI (systray) for full end-to-end testing with the tray icon. The backoff logic was tested via the test harness above against real Fleet endpoints. A manual QA test on a machine with a display should: kill orbit, watch Desktop logs for "ping failed, backing off" messages with increasing intervals, then restore orbit and verify "exiting backoff" appears.
Test plan
go test ./orbit/pkg/backoff/ -v-- 19 tests passgo build ./orbit/cmd/desktop/-- compiles cleanmake lint-go-incremental-- 0 issues