Skip to content

feat: add retry backoff for connection errors in retryOperation#775

Open
leshniak wants to merge 8 commits intoExpensify:mainfrom
callstack-internal:feat/retry-backoff-connection-errors
Open

feat: add retry backoff for connection errors in retryOperation#775
leshniak wants to merge 8 commits intoExpensify:mainfrom
callstack-internal:feat/retry-backoff-connection-errors

Conversation

@leshniak
Copy link
Copy Markdown
Contributor

@leshniak leshniak commented Apr 17, 2026

Details

Add exponential backoff with jitter to OnyxUtils.retryOperation for non-capacity storage errors, framed as an instrumented experiment to determine the right mitigation strategy.

Context: Connection-class errors — Chromium backing store failures (26.3%), WebKit connection drops (19.0%), and closing-database errors (6.4%) — account for 51.7% of all storage failures (investigation). Analysis of 7-day production logs via VictoriaLogs shows:

  • 0% recovery rate across 5 immediate retries — all attempts complete within the same millisecond
  • 100% retry exhaustion — 67,404 out of ~67,400 initial failures exhaust all retries
  • Users continue writing successfully to other keys in the same session (e.g. one user: 8,920 exhaustions alongside 28,206 successful Onyx ops)
  • IDB enters a degraded state rather than failing completely — successful writes interleaved with failures in the same request

We have no data on whether introducing a delay (100ms-1600ms) would allow recovery. This PR adds backoff as a low-risk experiment to collect that data.

Changes:

  • lib/OnyxUtils.ts: Added CONNECTION_ERRORS constants (IDB + SQLite), backoff config (RETRY_BASE_DELAY_MS=100, RETRY_JITTER_FACTOR=0.25), wait()/getRetryDelay() helpers, wired backoff into non-capacity error branch of retryOperation
  • Backoff schedule: 100ms → 200ms → 400ms → 800ms → 1600ms (±25% jitter, ~3.1s total max)
  • Capacity errors (QuotaExceeded, disk full) keep immediate retry with eviction — unchanged
  • Observability for experiment measurement:
    • Connection errors log retry attempts with delay duration (Connection error detected, retrying with backoff)
    • Successful recovery after backoff is logged with attempt number (Connection error recovered after backoff on attempt N/5)
    • Connection error exhaustion gets a distinct log message from generic failures (Connection error exhausted all retries with backoff)

Measurement plan (discussion):

  • Recovery rate: recovered / detected — core metric, baseline is 0%
  • Recovery distribution by attempt: which delay tier recoveries happen on most
  • Timeline: 1 week production data (~67k failures/week volume)
  • Rollback if: recovery stays ~0%, any regression in Onyx op success rate, or unexpected latency
  • Success if: recovery rate meaningfully above 0% → tune constants

Next steps based on production data:

  • If recovery rate at 100-1600ms delays improves → tune constants
  • If recovery rate remains ~0% → pivot to fail-fast (fewer retries + error propagation) or reconnection strategy (close/reopen IDB before retry)

Related Issues

Expensify/App#87782

Automated Tests

Updated 2 existing retry tests to use fake timers (backoff delays require timer advancement). Added 5 new tests:

  • should apply exponential backoff delay for non-capacity errors — verifies delay count and exponential growth pattern
  • should log connection error with backoff delay info — verifies connection-specific log message
  • should log recovery when connection error succeeds after backoff — verifies recovery log fires on successful retry
  • should log connection-specific exhaustion message when all retries fail — verifies distinct exhaustion log for connection errors
  • should NOT apply backoff delay for capacity errors (immediate retry with eviction) — verifies capacity errors remain immediate

All 439 tests pass.

Manual Tests

  1. Verify npm run typecheck passes
  2. Verify npm run lint passes
  3. Verify npm test passes (439/439)
  4. Integrate with Expensify/App and verify storage operations still work correctly on all platforms

Author Checklist

  • I linked the correct issue in the ### Related Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android / native
    • Android / Chrome
    • iOS / native
    • iOS / Safari
    • MacOS / Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that the left part of a conditional rendering a React component is a boolean and NOT a string, e.g. myBool && <MyComponent />.
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained why the code was doing something instead of only explaining what the code was doing.
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named index.js. All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.js or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.
  • I have checked off every checkbox in the PR author checklist, including those that don't apply to this PR.

Screenshots/Videos

Android: Native

N/A — library-level change, no UI

Android: mWeb Chrome

N/A — library-level change, no UI

iOS: Native

N/A — library-level change, no UI

iOS: mWeb Safari

N/A — library-level change, no UI

MacOS: Chrome / Safari

N/A — library-level change, no UI

leshniak and others added 4 commits April 16, 2026 22:34
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Errors like lost IDB connections, closing databases, and backing store
failures now wait with exponential backoff (100ms * 2^attempt +/- 25%
jitter) before retrying, giving the DB connection time to recover.

Capacity errors (QuotaExceeded, disk full) keep immediate retry with
eviction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Existing retry tests now use fake timers to handle backoff delays.
New tests verify: exponential delay progression, connection error
logging, and capacity errors remaining immediate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Recovery and exhaustion logs needed to measure whether backoff actually
helps connection errors recover. Without these, we can only see retry
attempts but not outcomes — making the experiment unmeasurable.

- Log recovery success with attempt number when connection error resolves after backoff
- Classify exhaustion logs: connection errors get distinct message from generic failures
- Extract isConnectionError earlier so both exhaustion and retry paths can use it
- Add 2 new tests for recovery and exhaustion logging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leshniak and others added 3 commits May 1, 2026 20:44
- Wrap onyxMethod return with Promise.resolve() to handle undefined returns
  (fixes perf-test TypeError: Cannot read properties of undefined)
- Use early-return guard for non-connection errors (fixes prefer-early-return lint)
- Rebuild API-INTERNAL.md (fixes verify check)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The directive was on the wrong line after restructuring — TS saw it as
unused since the actual type error is inside the .then() callback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leshniak leshniak marked this pull request as ready for review May 1, 2026 18:52
@leshniak leshniak requested a review from a team as a code owner May 1, 2026 18:52
@melvin-bot melvin-bot Bot requested review from cristipaval and removed request for a team May 1, 2026 18:53
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc0b381ddb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/OnyxUtils.ts
Comment on lines +856 to +860
Promise.resolve(onyxMethod(defaultParams, nextRetryAttempt)).then(() => {
if (!isConnectionError) {
return;
}
Logger.logInfo(`Connection error recovered after backoff on attempt ${nextRetryAttempt}/${MAX_STORAGE_OPERATION_RETRY_ATTEMPTS}. onyxMethod: ${onyxMethod.name}.`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid logging recovery when connection retries still exhaust

The new recovery log is emitted whenever onyxMethod(defaultParams, nextRetryAttempt) resolves, but retryOperation() itself resolves even after exhausting all retries, so a permanently failing connection error will still produce Connection error recovered... before/alongside the exhaustion alert. In practice this creates false-positive recovery telemetry for the experiment and can mislead rollout decisions, because exhausted retry chains are counted as successful recoveries.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds valid in the case where we fail multiple times rather than just once?

@cristipaval cristipaval requested a review from Julesssss May 1, 2026 20:03
@Julesssss
Copy link
Copy Markdown
Contributor

Reviewer Checklist

  • I have verified the author checklist is complete (all boxes are checked off).
  • I verified the correct issue is linked in the ### Fixed Issues section above
  • I verified testing steps are clear and they cover the changes made in this PR
    • I verified the steps for local testing are in the Tests section
    • I verified the steps for Staging and/or Production testing are in the QA steps section
    • I verified the steps cover any possible failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
    • I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
  • I checked that screenshots or videos are included for tests on all platforms
  • I included screenshots or videos for tests on all platforms
  • I verified that the composer does not automatically focus or open the keyboard on mobile unless explicitly intended. This includes checking that returning the app from the background does not unexpectedly open the keyboard.
  • I verified tests pass on all platforms & I tested again on:
    • Android: HybridApp
    • Android: mWeb Chrome
    • iOS: HybridApp
    • iOS: mWeb Safari
    • MacOS: Chrome / Safari
  • If there are any errors in the console that are unrelated to this PR, I either fixed them (preferred) or linked to where I reported them in Slack
  • I verified proper code patterns were followed (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick).
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
    • I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
    • I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I verified that this PR follows the guidelines as stated in the Review Guidelines
  • I verified other components that can be impacted by these changes have been tested, and I retested again (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar have been tested & I retested again)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.ts or at the top of the file that uses the constant) are defined as such
  • If a new component is created I verified that:
    • A similar component doesn't exist in the codebase
    • All props are defined accurately and each prop has a /** comment above it */
    • The file is named correctly
    • The component has a clear name that is non-ambiguous and the purpose of the component can be inferred from the name alone
    • The only data being stored in the state is data necessary for rendering and nothing else
    • For Class Components, any internal methods passed to components event handlers are bound to this properly so there are no scoping issues (i.e. for onClick={this.submit} the method this.submit should be bound to this in the constructor)
    • Any internal methods bound to this are necessary to be bound (i.e. avoid this.submit = this.submit.bind(this); if this.submit is never passed to a component event handler like onClick)
    • All JSX used for rendering exists in the render method
    • The component has the minimum amount of code necessary for its purpose, and it is broken down into smaller components in order to separate concerns and functions
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If a new CSS style is added I verified that:
    • A similar style doesn't already exist
    • The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG)
  • If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
  • If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
  • If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
    • I verified that all the inputs inside a form are aligned with each other.
    • I added Design label and/or tagged @Expensify/design so the design team can review the changes.
  • If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
  • For any bug fix or new feature in this PR, I verified that sufficient unit tests are included to prevent regressions in this flow.
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.
  • I have checked off every checkbox in the PR reviewer checklist, including those that don't apply to this PR.

Screenshots/Videos

Android: HybridApp
Android: mWeb Chrome
iOS: HybridApp
iOS: mWeb Safari
MacOS: Chrome / Safari

Copy link
Copy Markdown
Contributor

@Julesssss Julesssss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looking good. One bot comment to resolve.

  • Could you re-add the author checklist, i think it's missing some fields
  • Also not sure what is going on here: Duration deviation of 200.86 ms (1535365.98%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants