Skip to content

fix(prover-client): drop late stale-epoch reports + race-proof epoch teardown#23348

Draft
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
claudebox/fix-merge-train-spartan-23344
Draft

fix(prover-client): drop late stale-epoch reports + race-proof epoch teardown#23348
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
claudebox/fix-merge-train-spartan-23344

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Summary

Re-fix the flaky ProvingBroker > Retries > does not retry if job is stale (persisted variant) that caused PR #23344 to be dequeued from the merge queue (failing run, CI log key 64a972aafaa40dd0).

.test_patterns.yml previously suppressed this exact failure; #23260 removed the suppression after cleaning up in-memory state, but didn't address the underlying race between the broker's epoch cleanup and the LMDB-backed batched writes, so the flake recurred. The failure surfaces as Error('Store is closed') thrown out of BatchQueue.execProcessor → KVBrokerDatabase.commitWrites → SingleEpochDatabase.batchWrite → store.transactionAsync.

Fix

Two small, independent defenses:

  1. proving_broker.ts — in both #reportProvingJobError and #reportProvingJobSuccess, after the existing !item early-return, also drop the report when this.isJobStale(item). A late report for a job whose epoch has been (or is being) cleaned up never reaches the database.
  2. proving_broker_database/persisted.ts:
    • Track deletedEpochs: Set<number> so commitWrites short-circuits for epoch directories that are being torn down.
    • deleteAllProvingJobsOlderThanEpoch adds to deletedEpochs first, then removes from the epochs map before awaiting db.delete() (which synchronously marks the store closed and then awaits rm). This closes the window where a racing commitWrites could fetch a still-mapped, already-closed store.
    • Catch a residual Error('Store is closed') inside commitWrites as a benign drop. Any other error still propagates.

Verification

yarn workspace @aztec/prover-client test src/proving_broker/proving_broker.test.ts — 5 consecutive runs, 102/102 tests passing on each, including both in-memory and persisted variants of does not retry if job is stale.

Full analysis (failure timeline, the two races, why both defenses are needed): https://gist.github.com/AztecBot/612364332ec5a062bcf88ded9177334e

Test plan

  • Persisted-variant does not retry if job is stale passes deterministically in CI under merge-queue-heavy concurrency
  • No regressions in the rest of proving_broker.test.ts (retries, timeouts, database management blocks) — verified locally
  • Merge-queue run on this PR completes cleanly

Created by claudebox — group: slackbot

ClaudeBox log: https://claudebox.work/s/731f628c79243769?run=1

…teardown

Re-fix the flaky 'ProvingBroker > Retries > does not retry if job is stale'
test (persisted variant) that dequeued PR #23344 from the merge queue. The
failure surfaces as Error('Store is closed') from BatchQueue.execProcessor
-> KVBrokerDatabase.commitWrites -> SingleEpochDatabase.batchWrite ->
store.transactionAsync, triggered by a race between the broker's epoch
cleanupPass and the persisted DB's in-flight batched writes.

Two small, independent defenses:

- proving_broker.ts: in #reportProvingJobError and #reportProvingJobSuccess,
  after the existing !item early-return, also drop the report when
  isJobStale(item). A late report for a job whose epoch has been (or is
  being) cleaned up never reaches the database.

- proving_broker_database/persisted.ts: track deletedEpochs so commitWrites
  short-circuits for torn-down epoch directories; reorder
  deleteAllProvingJobsOlderThanEpoch to remove from the epochs map before
  await db.delete(); catch a residual 'Store is closed' inside commitWrites
  as a benign drop.

Verified locally: 5 consecutive runs of proving_broker.test.ts pass, 102/102
tests each, including both in-memory and persisted variants of the previously
flaky 'does not retry if job is stale'.
@AztecBot AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant