Skip to content

Add exponential backoff to Fleet Desktop server communications#45623

Draft
sharon-fdm wants to merge 3 commits into
mainfrom
worktree-agent-backoff
Draft

Add exponential backoff to Fleet Desktop server communications#45623
sharon-fdm wants to merge 3 commits into
mainfrom
worktree-agent-backoff

Conversation

@sharon-fdm
Copy link
Copy Markdown
Collaborator

@sharon-fdm sharon-fdm commented May 15, 2026

Closes #45624
Closes #45625

Summary

  • Adds orbit/pkg/backoff package: stateful exponential backoff tracker with jitter, thread-safe, per-path isolation
  • Integrates backoff into Fleet Desktop's ping loop and checkToken retry loop
  • On error: interval doubles each failure (10s, 20s, 40s, 80s, ...) capped at 30 minutes
  • On success: resets immediately to normal polling interval
  • Each communication path tracks its own backoff independently

Addresses corrective action #3 from the #44816 postmortem: Fleet Desktop kept polling /device/{token}/desktop at full rate with expired tokens, overwhelming the DB. With backoff, 250 hosts in this state produce ~0.14 req/s instead of 25 req/s (180x reduction).

Manual testing performed

1. Unit tests (19 total, all pass)

go test ./orbit/pkg/backoff/ -v -count=1   # 19 tests, 0 failures
make lint-go-incremental                    # 0 issues

13 logic tests: exponential doubling, cap at max, jitter range, reset on success, per-path isolation, concurrent access, no-give-up, monotonically non-decreasing intervals.

3 real-time ticker tests: actual time.Ticker with wall-clock measurements verifying intervals grow during failures and reset on success, matching how Desktop uses the tracker.

3 HTTP server integration tests: real httptest.TLSServer toggled between 200/401/500 mid-test, plus connection-refused scenario. Verified full lifecycle:

Phase 1 (healthy):    50ms, 50ms, 50ms
Phase 2 (401 errors): 51ms -> 109ms -> 216ms -> 448ms -> 508ms (cap)
Phase 3 (recovery):   instant reset to 50ms
Phase 4 (500 errors): 106ms -> 217ms -> 428ms
Phase 5 (recovery):   back to 50ms

2. Live Fleet server testing

Started Fleet dev server (FLEET_MYSQL_PASSWORD=fleet ./build/fleet serve --dev --dev_license), enrolled orbit, then ran test binary against real /device/{token}/desktop endpoint.

Invalid token (simulating #44816 expired-token scenario):

Attempt 1: HTTP 401, next=1.013s, failures=1
Attempt 2: HTTP 401, next=2.184s, failures=2
Attempt 3: HTTP 401, next=4.063s, failures=3
Attempt 4: HTTP 401, next=8.086s, failures=4
Attempt 5: HTTP 401, next=10s,    failures=5  (capped)
Attempt 6: HTTP 401, next=10s,    failures=6  (stays capped)

Server killed mid-poll (simulating server outage):

Attempt 1: HTTP 401,      next=1.042s, failures=1
Attempt 2: NETWORK ERROR, next=2.139s, failures=2  <-- server killed here
Attempt 3: NETWORK ERROR, next=4.154s, failures=3
Attempt 4: NETWORK ERROR, next=8s,     failures=4  (capped)

Seamless transition from HTTP errors to network errors -- same backoff behavior.

Recovery after server restart:

After success: interval=500ms, in_backoff=false, failures=0  <-- instant reset

3. Build verification

go build ./orbit/cmd/desktop/   # compiles clean
go build ./orbit/cmd/orbit/     # compiles clean

Limitation

Fleet Desktop requires a GUI (systray) for full end-to-end testing with the tray icon. The backoff logic was tested via the test harness above against real Fleet endpoints. A manual QA test on a machine with a display should: kill orbit, watch Desktop logs for "ping failed, backing off" messages with increasing intervals, then restore orbit and verify "exiting backoff" appears.

Test plan

  • go test ./orbit/pkg/backoff/ -v -- 19 tests pass
  • go build ./orbit/cmd/desktop/ -- compiles clean
  • make lint-go-incremental -- 0 issues
  • Manual: tested against live Fleet server with invalid tokens (401 backoff verified)
  • Manual: tested server-down scenario (network error backoff verified)
  • Manual: tested recovery (instant reset to base interval verified)
  • Manual QA: enroll orbit + desktop on a display machine, kill orbit, verify Desktop logs show increasing backoff intervals, restore orbit, verify reset

Fleet Desktop now backs off exponentially (with jitter) when receiving
errors from the Fleet server, preventing request storms that can
overwhelm the database. This addresses corrective action #3 from the
#44816 postmortem, where expired-token polling without backoff caused
a DB outage.

Changes:
- New `orbit/pkg/backoff` package: stateful tracker with exponential
  doubling, jitter, max cap, and per-path isolation
- Integrated into Desktop's ping loop and checkToken retry loop
- 13 unit tests covering all Oracle scenarios from #45553

Closes #45553
Three new tests that verify backoff behavior with actual time.Ticker
instances and wall-clock measurements, matching how Desktop uses the
tracker in its polling loop:

- TestTickerIntegration: 4 consecutive failures with measured growing
  intervals, then success resets to base
- TestTickerIntegrationMaxCap: verifies ticker caps at maxBackoff
- TestMultipleTrackersWithTickers: per-path isolation with real tickers
Three tests that verify backoff against real HTTP servers with
togglable error responses, connection-refused scenarios, and
measured wall-clock timing:

- TestManualBackoffAgainstHTTPServer: full lifecycle (healthy ->
  401 errors -> recovery -> 500 errors -> recovery) with real
  HTTP round-trips
- TestManualBackoffServerDown: connection-refused backoff
- TestManualBackoffMaxCapWithRealServer: continuous 401s until cap

Also manually tested against a live Fleet server:
- Invalid token -> 401 backoff: 500ms -> 1s -> 2.2s -> 4s -> 8s -> 10s (cap)
- Server killed mid-poll -> seamless transition to network-error backoff
- Recovery -> instant reset to base interval
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.76%. Comparing base (4c29a7a) to head (8383051).
⚠️ Report is 127 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #45623      +/-   ##
==========================================
+ Coverage   66.73%   66.76%   +0.02%     
==========================================
  Files        2732     2740       +8     
  Lines      218551   219009     +458     
  Branches    10840    10840              
==========================================
+ Hits       145857   146211     +354     
- Misses      59479    59573      +94     
- Partials    13215    13225      +10     
Flag Coverage Δ
backend 68.59% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Backoff PR2] Apply backoff to Fleet Desktop polling [Backoff PR1] Shared backoff package (orbit/pkg/backoff)

1 participant