Skip to content

fix(client): await_ready retries on transient errors instead of bailing#147

Merged
githubrobbi merged 3 commits intomainfrom
fix/await-ready-retry-on-other-errors
May 8, 2026
Merged

fix(client): await_ready retries on transient errors instead of bailing#147
githubrobbi merged 3 commits intomainfrom
fix/await-ready-retry-on-other-errors

Conversation

@githubrobbi
Copy link
Copy Markdown
Collaborator

Summary

Aligns the sync UffsClientSync::await_ready with its async sibling: any non-Ready, non-success outcome from a status poll (Loading status, I/O error, connection closed, RPC timeout, transient protocol error) now keeps polling until the caller-supplied timeout deadline elapses.

Root cause

crates/uffs-client/src/connect_sync.rs::await_ready matched Err(other) => return Err(other), so a single transient error from any in-flight status RPC aborted readiness probing immediately. The async sibling at crates/uffs-client/src/connect.rs::await_ready has always classified non-Ready, non-success outcomes as PollOutcome::OtherError and continued the loop — the sync path was the outlier.

Why now — 2026-05-07 Phase 7 soak finding

The Phase 7 24-h soak attempt failed with:

Error: Daemon did not become ready in time
  Caused by: request timed out

even though the captured daemon.log showed the daemon up and IPC-listening 1.3 s after spawn. The very first status RPC during the Windows AF_UNIX socket-bind window can hit its per-RPC deadline and surface as ClientError::Timeout, which the pre-fix code returned immediately — aborting probing while the daemon was healthy and one poll away from Ready.

PR #146 already added a soak-harness-level workaround (idempotent attach + race-tolerant spawn). This PR fixes the underlying CLI bug so direct uffs daemon start invocations no longer race.

Regression test

await_ready_retries_on_protocol_error_until_deadline (crates/uffs-client/src/connect_sync_tests.rs) feeds a JSON-RPC error response to the very first poll, surfaced as ClientError::Protocol.

  • Pre-fix: test fails in <1 ms with Err(Protocol("transient mid-handshake error"))
  • Post-fix: test takes the full 120 ms deadline and returns Err(ClientError::Timeout)

The 0.332 s observed test wall-clock confirms the loop actually polled.

Local validation

  • cargo nextest run -p uffs-client173/173 passed, including the new test and all four existing await_ready_* regression pins
  • just lint-fast — fmt-check, file-size, typos, reuse, lint-ci, lint-prod, lint-tests
  • just check-windowscargo xwin check --workspace --all-targets --all-features

Compliance audit (mandatory rules)

  1. No suppression hacks — no allow / cfg / ignore added; the fix is one match arm. File-size policy on connect_sync.rs was avoided by trimming an inline comment (the regression test's docstring carries the rationale), not by adding a file_size_exceptions.txt entry.
  2. Surgical, correct fixes — single match arm changed; mirrors the async sibling's existing PollOutcome::OtherError classification.
  3. Preserve behavior & contractsawait_ready still returns Err(ClientError::Timeout) on deadline expiry; only the fast-fail-on-transient-error edge case changes (and that path was the bug).
  4. Improve tests, don't dodge them — new regression test pins the contract; all existing await_ready_* tests still pass.

…ling

The sync `UffsClientSync::await_ready` matched `Err(other) => return Err(other)`, so a single transient error from any `status` poll aborted readiness probing immediately.

The async sibling at `crates/uffs-client/src/connect.rs::await_ready` has always classified non-Ready, non-success outcomes as `PollOutcome::OtherError` and continued the loop. This patch aligns the sync path with that behaviour: every non-Ready outcome — Loading/Refreshing status, I/O error, connection closed, RPC timeout, or transient `Protocol` error from a partial-response read — keeps polling until the caller-supplied `timeout` deadline elapses.

## Why now

The 2026-05-07 Phase 7 24-h soak attempt failed with `Daemon did not become ready in time / request timed out` even though the captured `daemon.log` showed the daemon up and IPC-listening 1.3 s after spawn. Root cause: the very first `status` RPC during the Windows `AF_UNIX` socket-bind window can hit its per-RPC deadline and surface as `ClientError::Timeout`, which the pre-fix code returned immediately. Post-fix the loop tolerates that transient and waits for the daemon to finish coming up.

## Regression test

`await_ready_retries_on_protocol_error_until_deadline` feeds a JSON-RPC error response to the very first poll, which surfaces as `ClientError::Protocol`. Pre-fix the loop would have returned that error in <1 ms; post-fix the test takes ~120 ms (the deadline) and returns `ClientError::Timeout` — confirming the loop polled rather than fail-fast.

Mac gates green: `cargo nextest -p uffs-client` (173/173), `just lint-fast`, `just check-windows`.
@githubrobbi githubrobbi enabled auto-merge (squash) May 7, 2026 23:14
@githubrobbi githubrobbi merged commit 2b016db into main May 8, 2026
26 checks passed
@githubrobbi githubrobbi deleted the fix/await-ready-retry-on-other-errors branch May 8, 2026 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant