feat: add retry backoff for connection errors in retryOperation by leshniak · Pull Request #775 · Expensify/react-native-onyx

leshniak · 2026-04-17T11:59:46Z

Details

Add exponential backoff with jitter to OnyxUtils.retryOperation for non-capacity storage errors, framed as an instrumented experiment to determine the right mitigation strategy.

Context: Connection-class errors — Chromium backing store failures (26.3%), WebKit connection drops (19.0%), and closing-database errors (6.4%) — account for 51.7% of all storage failures (investigation). Analysis of 7-day production logs via VictoriaLogs shows:

0% recovery rate across 5 immediate retries — all attempts complete within the same millisecond
100% retry exhaustion — 67,404 out of ~67,400 initial failures exhaust all retries
Users continue writing successfully to other keys in the same session (e.g. one user: 8,920 exhaustions alongside 28,206 successful Onyx ops)
IDB enters a degraded state rather than failing completely — successful writes interleaved with failures in the same request

We have no data on whether introducing a delay (100ms-1600ms) would allow recovery. This PR adds backoff as a low-risk experiment to collect that data.

Changes:

lib/OnyxUtils.ts: Added CONNECTION_ERRORS constants (IDB + SQLite), backoff config (RETRY_BASE_DELAY_MS=100, RETRY_JITTER_FACTOR=0.25), wait()/getRetryDelay() helpers, wired backoff into non-capacity error branch of retryOperation
Backoff schedule: 100ms → 200ms → 400ms → 800ms → 1600ms (±25% jitter, ~3.1s total max)
Capacity errors (QuotaExceeded, disk full) keep immediate retry with eviction — unchanged
Observability for experiment measurement:
- Connection errors log retry attempts with delay duration (Connection error detected, retrying with backoff)
- Successful recovery after backoff is logged with attempt number (Connection error recovered after backoff on attempt N/5)
- Connection error exhaustion gets a distinct log message from generic failures (Connection error exhausted all retries with backoff)

Measurement plan (discussion):

Recovery rate: recovered / detected — core metric, baseline is 0%
Recovery distribution by attempt: which delay tier recoveries happen on most
Timeline: 1 week production data (~67k failures/week volume)
Rollback if: recovery stays ~0%, any regression in Onyx op success rate, or unexpected latency
Success if: recovery rate meaningfully above 0% → tune constants

Next steps based on production data:

If recovery rate at 100-1600ms delays improves → tune constants
If recovery rate remains ~0% → pivot to fail-fast (fewer retries + error propagation) or reconnection strategy (close/reopen IDB before retry)

Related Issues

Expensify/App#87782

Automated Tests

Updated 2 existing retry tests to use fake timers (backoff delays require timer advancement). Added 5 new tests:

should apply exponential backoff delay for non-capacity errors — verifies delay count and exponential growth pattern
should log connection error with backoff delay info — verifies connection-specific log message
should log recovery when connection error succeeds after backoff — verifies recovery log fires on successful retry
should log connection-specific exhaustion message when all retries fail — verifies distinct exhaustion log for connection errors
should NOT apply backoff delay for capacity errors (immediate retry with eviction) — verifies capacity errors remain immediate

All 439 tests pass.

Manual Tests

Verify npm run typecheck passes
Verify npm run lint passes
Verify npm test passes (439/439)
Integrate with Expensify/App and verify storage operations still work correctly on all platforms