fix(prover-node): await in-flight jobs and guard world-state on stop#23338
Draft
AztecBot wants to merge 1 commit into
Draft
fix(prover-node): await in-flight jobs and guard world-state on stop#23338AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
4af2626 to
db4ec58
Compare
0e7aecb to
8089661
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the teardown SEGFAULT in CI log
92cbf66b564931cf(full analysis: original gist). Thee2e_fees/fee_settingstest body itself passed — the process died inafterAllbecause anEpochProvingJob.run()body was still calling into the native world-state addon afterProverNode.stop()returned.Applies fixes #1 and #3 from that analysis. Fix #2 was skipped per request.
Fix #1 —
ProverNode.stop()awaits in-flight jobsyarn-project/prover-node/src/prover-node.tsstartProofusedvoid this.runJob(job)(fire-and-forget) andstop()only awaitedjob.stop()for each tracked job.job.stop()only waits on the internalrunPromise(set partway throughrun()), not therunJobwrapper's post-runwork (tryUploadEpochFailure,createProvingJobon reorg) — both touch world-state.stop()calledthis.prover.stop()first, which only cancelled in-flight proving requests; the orchestrator'sasyncPoolbody kept creating forks and inserting L1→L2 messages.runJobpromises in aSet(runJobPromises) via a smalltrackRunJobhelper, and havestop()signaljob.stop()and await both the jobs and therunJobwrappers before stopping the prover/publisher/etc.Fix #3 — Defensive shutdown guard in
NativeWorldStateyarn-project/world-state/src/native/native_world_state_instance.tsclose()previously only drained the canonical (forkId=0) queue — in-flight per-fork-queue calls could race with the native CLOSE and segfault. The existingassert.equal(this.open, true, ...)was inside the queue's execute callback, so it didn't prevent new calls from being enqueued and only produced AssertionError-style failures.call()early check: if!this.open, throwNative world state is closed; cannot call <MSG>before any queue lookup.close()now drains every non-canonical per-fork queue (Promise.all(queue.stop())) before sending CLOSE on the canonical queue.Worst case during shutdown is now a recognisable JS error, never a SIGSEGV.
Tests
prover-node.test.ts: newawaits in-flight epoch jobs before stop resolvesblocksjob.run()viapromiseWithResolvers, callsstop(), asserts it does NOT resolve untilrunresolves.native_world_state.test.ts: newrejects calls issued after close with a JS error rather than crashingcloses a service and asserts a follow-upfork()rejects with/closed/i.Out of scope
cancelJobsOnStop: truefor in-process simulated prover-nodes).Details: https://gist.github.com/AztecBot/296c5eee3acd34793098a8304b93f383
Test plan
yarn workspace @aztec/prover-node test src/prover-node.test.tsgreen in CI (newawaits in-flight epoch jobs before stop resolves).yarn workspace @aztec/world-state test src/native/native_world_state.test.tsgreen in CI (newrejects calls issued after close…).e2e_fees/ e2e teardown.ClaudeBox log: https://claudebox.work/s/f403ab089cf18034?run=1