-
Notifications
You must be signed in to change notification settings - Fork 248
Make resumeHook() resilient to transient hook_received event write failures #1834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| --- | ||
| "@workflow/core": minor | ||
| "workflow": minor | ||
| "@workflow/world": minor | ||
| --- | ||
|
|
||
| Make `resumeHook()` resilient to transient `hook_received` event write failures (429/5xx) by carrying the payload on the queue message for the runtime to materialize. Returned `Hook` gets a new `resilientResume: true` flag when this fallback path is taken. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,6 @@ | ||
| import { | ||
| EntityConflictError, | ||
| HookNotFoundError, | ||
| RUN_ERROR_CODES, | ||
| RunExpiredError, | ||
| WorkflowRuntimeError, | ||
|
|
@@ -125,6 +126,7 @@ export function workflowEntrypoint( | |
| traceCarrier: traceContext, | ||
| requestedAt, | ||
| runInput, | ||
| hookInput, | ||
| } = WorkflowInvokePayloadSchema.parse(message_); | ||
| const { requestId } = metadata; | ||
| // Extract the workflow name from the topic name | ||
|
|
@@ -450,6 +452,85 @@ export function workflowEntrypoint( | |
| } | ||
| } | ||
|
|
||
| // --- Resilient resume: materialize missing hook_received --- | ||
| // When `resumeHook()` fires its hook_received event write and | ||
| // queue dispatch in parallel, the event write may fail with | ||
| // a transient 429/5xx while the queue dispatch succeeds. | ||
| // In that case `hookInput` is present on the queue payload, | ||
| // carrying the dehydrated payload + a client-minted | ||
| // idempotency key (`resumeId`). If no existing hook_received | ||
| // event already carries that `resumeId`, we materialize one | ||
| // here so the workflow replay sees the payload. Mirrors | ||
| // `start()`'s resilient path for run_created → run_started. | ||
| if (hookInput) { | ||
| const alreadyMaterialized = events.some( | ||
| (e) => | ||
| e.eventType === 'hook_received' && | ||
| e.correlationId === hookInput.hookId && | ||
| (e.eventData as { resumeId?: string } | undefined) | ||
| ?.resumeId === hookInput.resumeId | ||
| ); | ||
| if (!alreadyMaterialized) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd usually say this isn't safe (TOCTOU race): We should pass However, since events.create -> queue is in sequence, and we only send |
||
| try { | ||
| const result = await world.events.create( | ||
| runId, | ||
| { | ||
| eventType: 'hook_received', | ||
| specVersion: | ||
| workflowRun.specVersion ?? SPEC_VERSION_CURRENT, | ||
| correlationId: hookInput.hookId, | ||
| eventData: { | ||
| payload: hookInput.payload as any, | ||
| resumeId: hookInput.resumeId, | ||
| }, | ||
| }, | ||
| { requestId } | ||
| ); | ||
| if (result.event) { | ||
| events.push(result.event); | ||
| } | ||
| runtimeLogger.warn( | ||
| 'Materialized hook_received event from queue payload (resilient resume)', | ||
| { | ||
| workflowRunId: runId, | ||
| hookId: hookInput.hookId, | ||
| resumeId: hookInput.resumeId, | ||
| } | ||
| ); | ||
| span?.setAttributes({ | ||
| ...Attribute.HookResilientResumeMaterialized(true), | ||
| }); | ||
| } catch (err) { | ||
| if (EntityConflictError.is(err)) { | ||
| // Another queue delivery already materialized this | ||
| // hook_received event — safe to ignore. | ||
| runtimeLogger.info( | ||
| 'Hook resilient-resume materialization skipped (already exists)', | ||
| { | ||
| workflowRunId: runId, | ||
| hookId: hookInput.hookId, | ||
| resumeId: hookInput.resumeId, | ||
| } | ||
| ); | ||
| } else if (HookNotFoundError.is(err)) { | ||
| // The hook was disposed between resumeHook() and | ||
| // this queue delivery. Drop the resume — there is | ||
| // no active awaiter to deliver it to. | ||
| runtimeLogger.warn( | ||
| 'Hook was disposed before resilient resume could materialize — dropping payload', | ||
| { | ||
| workflowRunId: runId, | ||
| hookId: hookInput.hookId, | ||
| resumeId: hookInput.resumeId, | ||
| } | ||
| ); | ||
| } else { | ||
| throw err; | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Resolve the encryption key for this run's deployment | ||
| const rawKey = | ||
| await world.getEncryptionKeyForRun?.(workflowRun); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment explains resilient resume in terms of
resumeHook()writing the event and queueing “in parallel”, butresumeHook()is now sequential specifically to avoid the duplicate-materialization race. Please update this block to reflect the current ordering and the actual condition forhookInputbeing present (only when the direct write failed with a retryable error).