[core] V2: unify wait+step queue dispatch in suspension processing#1925
[core] V2: unify wait+step queue dispatch in suspension processing#1925VaguelySerious wants to merge 5 commits intomainfrom
Conversation
Fix `Promise.race(step, sleep)` semantics in V2 mixed suspensions without losing inline step execution. Inline `await executeStep(...)` blocks the V2 handler for the full step duration, but `wait_completed` events are only created on the *next* loop iteration's "complete elapsed waits" pass. So if the sleep is shorter than the step, replay always picked the step because the wait_completed event hadn't been written yet — `sleepWinsRaceWorkflow` returned `'step'` instead of `'sleep'`. Fix: when a suspension contains both an owned inline step and at least one pending wait, queue a delayed self-message with `delaySeconds = suspensionResult.timeoutSeconds` *before* starting inline execution. The queued continuation fires in a separate function invocation while the step is still running. That parallel invocation's "complete elapsed waits" pass writes wait_completed, replay observes the elapsed wait, and `Promise.race` resolves with the sleep correctly. The original (still-running) inline invocation finishes its step, sees `run_completed` on the next loop iteration, and exits. This preserves inline-step execution speed for the step-wins case: the step finishes inline and the workflow returns directly. The eagerly-queued wait continuation fires after the step has won and just no-ops on the terminal run. Test plan: - New e2e tests `sleepWinsRaceWorkflow` and `stepWinsRaceWorkflow` exercising `Promise.race` between a step function and `sleep()`, in both directions. - Verified locally against `nextjs-turbopack` workbench: both pass. Event log confirms `wait_completed` is created at t≈1s after `wait_created` (1s sleep) instead of at t≈11s after the inline step finishes. Eager-processing changelog updated with a "Mixed Suspensions" section describing the pre-scheduled wait approach and its rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 3f67059 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Conflicts: - docs/content/docs/changelog/eager-processing.mdx — took main's "Mixed Suspensions" section as the base. Will be rewritten in the unification commits that follow. - packages/core/src/runtime.ts — auto-merged. Option A's `inlineStep === undefined when timeoutSeconds !== undefined` check is now in place from main, which makes Option B's eager-wait-queue branch unreachable. Will collapse the two into a unified queue dispatch in the next commits.
The local queue's `queue()` enqueue path ignored the `delaySeconds`
option entirely — every message was delivered immediately, regardless
of the requested delay. VQS-side queues (used by world-vercel and
world-postgres) honor delaySeconds at the broker, so this brings
world-local in line with production semantics.
The runtime needs this to land before the wait-as-continuation
unification in the next commit: that change starts queueing wait
timers as fresh delayed continuations instead of returning
`{ timeoutSeconds }`. Without delaySeconds support, those wait
continuations would fire instantly in dev and trigger spurious
replays.
Sleep happens outside the queue's worker semaphore so a delayed
message doesn't tie up a worker slot during its delay window — other
immediate messages are free to dispatch in parallel.
New tests in queue.test.ts cover:
- delaySeconds > 0 → setTimeout called with the right ms value
- delaySeconds === 0 → no setTimeout (immediate dispatch)
- delaySeconds omitted → no setTimeout (immediate dispatch)
Replace the asymmetric "steps go to the queue, waits become a
{ timeoutSeconds } return value" pattern with a single Promise.all
batch that queues every pending operation we are not running inline.
Before this change, suspension processing had three branches that
all needed to keep the wait/step asymmetry consistent:
- pendingSteps.length === 0 returned { timeoutSeconds }
- inlineStep + waits eagerly queued a delayed self-message AND set
inlineStep to undefined (Option A) AND returned { timeoutSeconds }
- inlineStep retry path returned { timeoutSeconds } if there were waits
After this change, every suspension goes through one path:
for non-inline pendingSteps: queue stepId message
if timeoutSeconds defined: queue delayed continuation
await Promise.all(dispatches)
if !inlineStep: return
await executeStep(inlineStep)
Behaviorally, this restores inline step execution even when the
suspension also has a wait (Option A's carve-out is no longer
necessary): the wait timer fires in a separate function invocation
on the queue, in parallel with the inline step. If the sleep wins
the race, that parallel invocation observes wait_completed via the
"complete elapsed waits" pass and finishes the run; if the step
wins, the wait continuation fires later and no-ops on the terminal
run via the existing terminal-event check.
Other cleanups:
- The inline-step retry path no longer needs to forward
suspensionResult.timeoutSeconds — the wait timer was already
enqueued as part of the unified dispatch above.
- A dead post-step `if (timeoutSeconds && pendingSteps.length === 1)`
block (just a comment, no body) is removed; the loop's
"complete elapsed waits" pass handles the same case correctly.
- Step queueing now uses a shared `traceCarrier` rather than
re-serializing per step.
Retry/throttle and hook-conflict paths still return { timeoutSeconds }
since their semantics are "redeliver THIS message after a delay"
rather than "schedule a fresh wait timer." Those can be unified in
a follow-up.
Test plan:
- New e2e tests `sleepWinsRaceWorkflow` and `stepWinsRaceWorkflow`
pass against the `nextjs-turbopack` workbench.
- Event log inspection confirms wait_completed fires at t≈1s (after
wait_created at t≈0s) for the sleep-wins case, and that the inline
step runs only once (no duplicate step_started events that the
earlier eager-queue approach produced in dev).
- All 842 @workflow/core unit tests pass.
- All 346 @workflow/world-local unit tests pass (with the
delaySeconds support added in the previous commit).
Requires the world-local delaySeconds fix in the prior commit;
without it, wait continuations would fire instantly in dev and the
parallel replay would re-enter handleSuspension before the wait
elapsed (recoverable via existing redelivery, but inefficient).
Update the "Mixed Suspensions" section in eager-processing.mdx to describe the unified parallel-dispatch model: - All non-inline pendingSteps are queued with stepId - The wait timer (if any) is queued as a delayed continuation - All dispatched in one Promise.all batch - One owned step is then inline-executed (if any) The doc previously described Option A (the carve-out where waits forced all steps to be queued); the unified model removes that carve-out and explains why the wait continuation works in parallel with the inline step. Also notes the dependency on world-local's new delaySeconds support (landed earlier in the same PR series). Changeset bumps both @workflow/core and @workflow/world-local since both packages have user-observable behavior changes.
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🐘 Local Postgres (11 failed)astro-stable (1 failed):
express-stable (1 failed):
fastify-stable (1 failed):
hono-stable (1 failed):
nextjs-turbopack-stable-lazy-discovery-disabled (1 failed):
nextjs-webpack-canary (1 failed):
nextjs-webpack-stable-lazy-discovery-disabled (1 failed):
nextjs-webpack-stable-lazy-discovery-enabled (1 failed):
nuxt-stable (1 failed):
sveltekit-stable (1 failed):
vite-stable (1 failed):
📋 Other (1 failed)e2e-local-postgres-tanstack-start- (1 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
❌ 🐘 Local Postgres
✅ 🪟 Windows
❌ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
Summary
Replaces the prior eager-wait-queue patch (and supersedes #1924's carve-out — see history below) with a single unified queue dispatch in V2 suspension processing.
V2 had two different mechanisms for "do something later":
stepId+stepNameand an idempotency key.{ timeoutSeconds }); the queue redelivered the same message after that delay.The asymmetry was load-bearing in three branches of suspension processing and made
Promise.race(step, sleep)semantically incorrect under inline step execution.This PR drops the asymmetry: every pending operation we are not running inline becomes an outbound queue message, dispatched in one
Promise.allbatch.Inline step execution is restored even when there's a pending wait — the wait timer fires in a separate function invocation, in parallel with the inline step. If the sleep wins, that parallel invocation observes
wait_completedvia the "complete elapsed waits" pass and finishes the run; if the step wins, the wait continuation fires later and no-ops on the terminal run.Commits
[world-local] Honor delaySeconds before message delivery—world-local's queue ignoreddelaySeconds, so wait continuations would fire instantly in dev. Now matches the broker behavior ofworld-vercelandworld-postgres. Three new unit tests for the delay branch.[core] V2: unify wait+step queue dispatch in suspension processing— the actual unification.runtime.tsis 50 lines shorter and the suspension processing has one path instead of three.[docs] V2 unified suspension dispatch + changeset—eager-processing.mdx"Mixed Suspensions" section rewritten to describe the unified model. Changeset.Out of scope (kept as
{ timeoutSeconds })The retry/throttle and hook-conflict paths still return
{ timeoutSeconds }because their semantics are "redeliver THIS message after a delay" rather than "schedule a fresh wait timer." Unifying those would reset the message attempt counter and require a different runaway-loop guard. Tracked as a follow-up.Reproducer (failing test from #1916)
sleepWinsRaceWorkflow: 1s sleep raced against 10s step. Expected'sleep'to win.Event log on the failing run, before any fix:
step_created+wait_createdstep_startedstep_completedwait_completedrun_completed'step'Event log after this PR:
step_created+wait_createdstep_startedwait_completedrun_completed'sleep'step_completedThe 10s inline step continues to run in the background —
world-local'sstep_completedwrite on a terminal run is allowed (writes are not gated on run state for step lifecycle events; the orphan event has no observable effect). Onworld-vercel/world-postgres, the same thing happens — the step body runs to completion in its inline invocation butstep_completedeither succeeds against the terminal run or is no-oped depending on the world implementation.Test plan
sleepWinsRaceWorkflowandstepWinsRaceWorkflow(the failing tests from Add dev-tmux skill for portless+tmux local Workflow SDK dev #1916, plus the symmetric stepWins case) pass against thenextjs-turbopackworkbench locally.wait_completedfires at t≈1s afterwait_created(not after the inline step), andstep_startedappears exactly once per run (the eager-queue prototype produced duplicates in dev).@workflow/coreunit tests pass.@workflow/world-localunit tests pass (3 new fordelaySeconds).History
🤖 Generated with Claude Code