Skip to content

fix(prover-node): await in-flight jobs and guard world-state on stop#23338

Draft
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
claudebox/fix-prover-node-stop-segfault
Draft

fix(prover-node): await in-flight jobs and guard world-state on stop#23338
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
claudebox/fix-prover-node-stop-segfault

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Summary

Fixes the teardown SEGFAULT in CI log 92cbf66b564931cf (full analysis: original gist). The e2e_fees/fee_settings test body itself passed — the process died in afterAll because an EpochProvingJob.run() body was still calling into the native world-state addon after ProverNode.stop() returned.

Applies fixes #1 and #3 from that analysis. Fix #2 was skipped per request.

Fix #1ProverNode.stop() awaits in-flight jobs

yarn-project/prover-node/src/prover-node.ts

  • startProof used void this.runJob(job) (fire-and-forget) and stop() only awaited job.stop() for each tracked job. job.stop() only waits on the internal runPromise (set partway through run()), not the runJob wrapper's post-run work (tryUploadEpochFailure, createProvingJob on reorg) — both touch world-state.
  • Worse, stop() called this.prover.stop() first, which only cancelled in-flight proving requests; the orchestrator's asyncPool body kept creating forks and inserting L1→L2 messages.
  • Now track runJob promises in a Set (runJobPromises) via a small trackRunJob helper, and have stop() signal job.stop() and await both the jobs and the runJob wrappers before stopping the prover/publisher/etc.

Fix #3 — Defensive shutdown guard in NativeWorldState

yarn-project/world-state/src/native/native_world_state_instance.ts

  • close() previously only drained the canonical (forkId=0) queue — in-flight per-fork-queue calls could race with the native CLOSE and segfault. The existing assert.equal(this.open, true, ...) was inside the queue's execute callback, so it didn't prevent new calls from being enqueued and only produced AssertionError-style failures.
  • Add a top-of-call() early check: if !this.open, throw Native world state is closed; cannot call <MSG> before any queue lookup.
  • close() now drains every non-canonical per-fork queue (Promise.all(queue.stop())) before sending CLOSE on the canonical queue.

Worst case during shutdown is now a recognisable JS error, never a SIGSEGV.

Tests

  • prover-node.test.ts: new awaits in-flight epoch jobs before stop resolves blocks job.run() via promiseWithResolvers, calls stop(), asserts it does NOT resolve until run resolves.
  • native_world_state.test.ts: new rejects calls issued after close with a JS error rather than crashing closes a service and asserts a follow-up fork() rejects with /closed/i.

Out of scope

  • C++-side destruction guard / refcount inside the native addon itself (defence-in-depth beyond the JS wrapper).
  • Fix feat(json-rpc): initial package #2 (cancelJobsOnStop: true for in-process simulated prover-nodes).

Details: https://gist.github.com/AztecBot/296c5eee3acd34793098a8304b93f383

Test plan

  • yarn workspace @aztec/prover-node test src/prover-node.test.ts green in CI (new awaits in-flight epoch jobs before stop resolves).
  • yarn workspace @aztec/world-state test src/native/native_world_state.test.ts green in CI (new rejects calls issued after close…).
  • No regressions in e2e_fees / e2e teardown.

ClaudeBox log: https://claudebox.work/s/f403ab089cf18034?run=1

@AztecBot AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels May 16, 2026
@ludamad ludamad force-pushed the merge-train/spartan branch from 4af2626 to db4ec58 Compare May 16, 2026 19:07
@AztecBot AztecBot force-pushed the claudebox/fix-prover-node-stop-segfault branch from 0e7aecb to 8089661 Compare May 16, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant