Skip to content

fix(rivetkit): add lifecycle error retry and gateway HTTP routing#4666

Draft
NathanFlurry wants to merge 1 commit into04-14-chore_engine_publish_engine_bases_in_cifrom
04-14-fix_rivetkit_add_lifecycle_error_retry_and_gateway_http_routing
Draft

fix(rivetkit): add lifecycle error retry and gateway HTTP routing#4666
NathanFlurry wants to merge 1 commit into04-14-chore_engine_publish_engine_bases_in_cifrom
04-14-fix_rivetkit_add_lifecycle_error_retry_and_gateway_http_routing

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 15, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude bot commented Apr 15, 2026

Code Review: fix(rivetkit): add lifecycle error retry and gateway HTTP routing

Overview

This PR introduces two main changes:

  1. Lifecycle error retry: A new lifecycle-errors.ts module classifies actor restart/stop signals and wraps action calls in retryOnLifecycleBoundary, eliminating hacky waitFor sleep workarounds from tests.
  2. Gateway HTTP routing fix: Ensures HTTP requests are proxied under the /request/ prefix consistently across the gateway, inline client driver, engine actor driver, and remote engine client.

The WebSocket listener ordering fix (attach close/error before onOpen) is a correct and welcome correctness fix.


Positive Aspects

  • The new ActorRestarting error (503, retryable: true, structured metadata) is a clean API improvement over string-matching.
  • Removing the waitFor sleep hacks from tests is the right call — lifecycle retry makes them unnecessary.
  • TODO comments consistently reference RVT-6193, making the legacy paths easy to find and clean up.
  • walkErrorChain provides a principled way to inspect nested error causes.

Issues and Suggestions

1. URL normalization logic is duplicated across 4 locations

The /request/ prefix normalization appears independently in:

  • gateway.ts:buildActorRawHttpProxyUrl
  • test-inline-client-driver.ts (inline, lines 219-226)
  • actor-driver.ts:#normalizeEnvoyHttpRequestForActorRouter
  • engine-client/mod.ts:rawHttpProxyPath

This should be extracted to a shared utility to keep the normalization rule in one place.

2. #normalizeEnvoyHttpRequestForActorRouter hardcoded allowlist is fragile (actor-driver.ts:1252-1278)

The method uses an explicit allowlist of known paths (/action/, /queue/, /websocket/, /inspector/, etc.) to decide whether to skip the /request/ prefix. Any new router endpoint that does not start with a known prefix will silently get double-prefixed. Consider:

  • Flipping to a denylist approach (paths that must NOT be prefixed)
  • Using a marker header set by the caller to indicate the request is already routed correctly
  • Documenting this list clearly with a comment pointing to router-endpoints.ts

3. actor-conn.ts catch block swallows non-abort errors as "aborted" (actor-conn.ts:428-435)

if (
  err instanceof AbortError ||
  (err as Error).name === 'AbortError' ||
  !this.#shouldRetryConnectionOpenError(err)
) {
  logger().info({ msg: 'connection retry aborted' });
}

A non-abort, non-retryable error will hit this branch and be logged only as "connection retry aborted" with the error detail dropped. Consider logging at warn or error level with the actual error when it is not a true abort.

4. lifecycle-errors.ts transport error string matches need source pointers

The three exact string matches ("WebSocket connection closed during shutdown", "envoy shut down", "envoy shutting down") are tagged TODO/RVT-6193 which is appropriate. However, if these strings change in the runner, retries will silently stop working. Add a comment pointing to the source location of each string so they stay easy to keep in sync.

5. No unit tests for lifecycle-errors.ts

The classification logic has significant branches: structured vs. legacy, actor vs. transport, request_retry vs. reconnect_only, the database-stop exclusion, and error-chain walking. Unit tests covering the main classification paths would provide a safety net for future changes.

6. Agent notes committed to repo

.agent/notes/04-14-break-this-up-pr-plan.md and .agent/notes/driver-test-suite-priority-checklist.md appear to be working notes from the development session. Consider removing these before marking the PR ready, or confirm they belong in the final commit.

7. PR description is blank

The checklist is entirely unchecked and there is no test description. Should be filled out before marking ready (Bug fix + New feature for ActorRestarting).


Minor Notes

  • The isActorStoppingConnectionError helper removed from actor-workflow.ts is cleanly replaced by retryOnLifecycleBoundary.
  • The waitFor removal from test files is correct given lifecycle retry now handles restarts, but verify the tests still exercise real sleep/wake transitions rather than skipping them.
  • retryOnLifecycleBoundary defaults to 5 attempts with delays up to 200ms (~575ms total). If actor restart takes longer, callers may see false failures. Consider documenting or increasing the default.

Summary

The core approach is sound. Structured lifecycle errors + client-side retry is the right model, and the gateway routing fix is correct. Main concerns before marking ready: deduplicate the URL normalization logic, add unit tests for lifecycle-errors.ts, and clean up the agent notes from the commit. The legacy string-match paths are acceptable given the TODO tracking.

@github-actions
Copy link
Copy Markdown
Contributor

Preview packages published to npm

Install with:

npm install rivetkit@pr-4666

All packages published as 0.0.0-pr.4666.c672e53 with tag pr-4666.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-c672e53
docker pull rivetdev/engine:full-c672e53
Individual packages
npm install rivetkit@pr-4666
npm install @rivetkit/react@pr-4666
npm install @rivetkit/rivetkit-native@pr-4666
npm install @rivetkit/workflow-engine@pr-4666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant