Skip to content

fix(yarn-project): retry sqlite3mc vendor download on transient errors#23304

Draft
AztecBot wants to merge 26 commits into
merge-train/spartanfrom
claudebox/fix-merge-train-spartan-23296
Draft

fix(yarn-project): retry sqlite3mc vendor download on transient errors#23304
AztecBot wants to merge 26 commits into
merge-train/spartanfrom
claudebox/fix-merge-train-spartan-23296

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

Problem

yarn-project/sqlite3mc-wasm/scripts/vendor.sh uses curl -fsSL with no retry logic. A single DNS failure against release-assets.githubusercontent.com (the redirected CDN host for GitHub release downloads) is enough to abort compile_all and break the entire CI run.

Most recent occurrence: merge-train/spartan run on commit 0291cc27e4 (PR #23296) — http://ci.aztec-labs.com/1778833995527371

==> Downloading sqlite3mc-2.2.4-sqlite-3.50.4-wasm.zip
curl: (6) Could not resolve host: release-assets.githubusercontent.com

The failure is unrelated to the code changes in #23296 — the script is hit on any fresh runner that doesn't have the vendored files cached.

Fix

Add retry flags to the curl call:

  • --retry 5 — up to 5 retries
  • --retry-delay 2 — 2s pause between retries
  • --retry-all-errors — retries on DNS/connect failures (curl error 6), which --retry alone skips

Requires curl ≥ 7.71.0 (CI image ships curl 8.5.0).

Verification

Ran bash yarn-project/sqlite3mc-wasm/scripts/vendor.sh ensure from a fresh state — download, SHA256 verification, extraction, and SHA256SUMS regeneration all succeeded. Full ./bootstrap.sh ci was not rerun locally; the original failure was a non-deterministic DNS event that doesn't reproduce.

Full analysis: https://gist.github.com/AztecBot/b9b0513711b45ba68e3a262fbf11792b

ClaudeBox log: https://claudebox.work/s/43a854d921f36a1a?run=1

fcarreiro and others added 26 commits May 13, 2026 17:31
…al pipeline (#23245)

## Summary

- Removes `FastTxCollection` as a separate class and absorbs all its
logic directly into `TxCollection`
- Replaces the old parallel file-store delay with a single sequential
pipeline: node RPC → reqresp → file store, where each phase blocks on
the previous (cancellation-aware)
- File store collection is now driven by `IRequestTracker` — the same
synchronization primitive used by node and reqresp paths. The tracker is
the single source of truth for "is this tx still missing?" and "is this
request still alive?"
- `FileStoreTxCollection` simplified: dropped
`start()`/`stop()`/persistent worker pool/`wakeSignal`.
`startCollecting(requestTracker, context)` returns `Promise<void>`,
spins up its own per-call worker pool, and workers self-terminate when
the tracker is cancelled (all-fetched / deadline / external)

## Collection flow inside `collectFast`

1. Start node RPC collection in the background
2. Wait `txCollectionFastNodesTimeoutBeforeReqRespMs` — interruptible by
cancellation **or by node exhaustion** (so when no nodes are configured,
reqresp starts immediately)
3. Start reqresp in the background (parallel with nodes)
4. Wait `txCollectionFileStoreFastDelayMs` — interruptible by
cancellation or reqresp completion
5. Start file store collection in the background (its workers
self-terminate)
6. `Promise.allSettled` on node + reqresp + file store

`txCollectionFileStoreFastDelayMs` description updated to reflect it is
now anchored to reqresp start, not collection start.

## File store / tracker integration

- `FileStoreTxCollection.startCollecting` no longer takes `(txHashes,
context, deadline)`; it takes `(requestTracker, context)` and reads the
missing txs + deadline from the tracker
- Workers check `requestTracker.isMissing(hash)` each scan — if the tx
was found via another path (node/reqresp/gossipsub), the entry is
dropped without an extra fetch
- Workers race their backoff sleeps against
`requestTracker.cancellationToken` — cancelling a request (deadline,
`stopCollectingForBlocksUpTo/After`, or `stop()`) propagates to file
store workers immediately
- Removed `foundTxs`/`clearPending` plumbing on `FileStoreTxCollection`
— the tracker handles both implicitly
- `startCollecting` yields once after building its entry set, so a
synchronous follow-up call (e.g. `markFetched` in tests, or the
gossipsub-found path in production) lands before workers begin scanning

## Tests

- `tx_collection.test.ts`: collapsed the `TestFastTxCollection`
subclass; all accesses go directly through `TxCollection`. Added "starts
reqresp immediately when no nodes are configured" covering the
node-exhaustion shortcut
- `file_store_tx_collection.test.ts`: rewritten for the new shape — no
`start()`/`stop()`, lifecycle driven by the tracker (cancel to terminate
workers). New "workers exit when tracker is cancelled" covers the
per-call worker-pool teardown

Closes
https://linear.app/aztec-labs/issue/A-933/tx-collection-dont-retrieve-transactions-that-have-already-been
via new synchronization with the request tracker.
…ims (#23165)

## Context

`SequencerPublisher` simulates each enqueued L1 action individually at
enqueue time, then sends them bundled through Multicall3. The `propose`
checkpoint action is validated at enqueue and send time (the latter via
a `preCheck` mechanism), but in isolation and relying on overrides.
There is no simulation of the multicall payload before sending it, so a
reverting tx is most likely not caught.

This refactor:

- Replaces the per-request `preCheck` mechanism with a **single
bundle-level `eth_simulateV1`** of the assembled `aggregate3` payload,
run right before send. If any entry reverts in sim it is dropped from
the bundle, the reduced bundle is re-simulated to get an honest
`gasUsed`, and the survivors are sent. Extracted to a
`SequencerBundleSimulator`.
- Drops the entire propose simulate at enqueue (`simulateProposeTx`,
`validateCheckpointForSubmission`). The bundle simulate covers it.
- Adds a new pre-broadcast `validateBlockHeader` call (calling
`validateHeaderWithAttestations` with empty attestations +
`ignoreSignatures: true`) that catches header-level bugs before we
gossip the proposal to peers. Emits a new `header-validation-failed`
event on failure.
- Drops every per-action simulate at enqueue (governance signal **and**
slashing votes/executes). Bundle simulate at send time is the single
decision point for every per-action revert. `simulateAndEnqueueRequest`
is deleted. We were enqueuing votes even if the simulation failed, after
all.
- Rewrites `sendRequestsAt` so it takes an L2 `SlotNumber`, derives the
timestamp for the start of that slot, and sleeps until one L1 slot
before that boundary, so we can land on the first L1 slot of the target
L2 slot.
- Centralises `SimulationOverridesPlan` construction into a single
`buildCheckpointSimulationOverridesPlan` helper. The plan **always**
pins both `pending` and `proven` chain tips (to the pipelined parent /
invalidation target, or to the current snapshot when neither applies),
so `STFLib.canPruneAtTime` cannot reintroduce a phantom prune during
simulation.
- Makes `SimulationOverridesBuilder.merge` undefined-safe: explicit
`undefined` fields in an incoming plan no longer erase previously-set
values. `withPendingTempCheckpointLogFields` now accepts a partial
subset of fields.
- Moves the payload-empty cache onto `GovernanceProposerContract` next
to its concern. Only `isPayloadEmpty=false` is cached (a CREATE2
redeploy could go empty → populated).
- Drops the old Multicall3 revert-recovery and per-request-resim
machinery, since with `allowFailure: true` the top-level multicall is
expected to land successfully. `Multicall3.forward` now throws
`MulticallForwarderRevertedError` if the receipt reports a reverted
status; the publisher does **not** rotate to a new publisher on that
error (on-chain failure, not a send failure). Adds `Multicall3.hasCode`
helper and a `simulateAggregate3` entrypoint used by the bundle
simulator.
- `L1TxUtils.sendTransaction` fails fast if `txTimeoutAt` has already
elapsed when called. `SequencerPublisher.forwardWithPublisherRotation`
re-checks the deadline at the head of each rotation iteration so it
doesn't keep cycling through publishers after the L2 slot's submission
window has closed.
- Sequencer escape-hatch (`voteInSlotWithoutSyncing`) and
full-escape-hatch (`voteOnSlotWithEscapeHatch`) vote-only paths now
submit via `sendRequestsAt(slot)` rather than `sendRequests()`, so the
bundle-simulate `block.timestamp` override matches the slot the EIP-712
vote signatures were generated for.

The intended outcome is a publisher with one explicit re-validation
point (the bundle simulate), measurable bundle gas (from the bundle
simulate's `gasUsed`), and dead/duplicated state-override plumbing
removed.

## Resulting simulations after this refactor

The full list of simulation / gas-estimation steps that remain in a
pipelined proposer slot, in execution order.

### Pre-build, in `Sequencer.doWork`

1. **`publisher.canProposeAt`** — rollup view call simulated with the
centralised override plan. Cheap pre-check gate before any block-build
work.
2. **`publisher.simulateInvalidateCheckpoint`** (conditional) — runs
**only** if `syncedTo.pendingChainValidationStatus.valid === false` AND
`!syncedTo.hasProposedCheckpoint`. Simulates the invalidate call against
the rollup. Result becomes the `invalidateCheckpoint` package passed
into `CheckpointProposalJob`. The previous code called this even when
there's a proposed parent and discarded the result; this refactor adds
the `!hasProposedCheckpoint` gate so we skip the wasted RPC.

### Per-slot, in `CheckpointProposalJob.proposeCheckpoint`

3. **CheckpointVoter votes** — `CheckpointVoter.enqueueVotes()` runs at
the top of `execute()`, returning two promises that are awaited in
parallel with block-build. It enqueues two kinds of votes via the
publisher, **neither of which simulates at enqueue time** after this
refactor:
- **`enqueueGovernanceCastSignal`** — does an `isPayloadEmpty`
pre-flight check (now on `GovernanceProposerContract`), then enqueues.
No `eth_simulateV1`.
- **`enqueueSlashingActions`** (one call per slashing action, type
`vote-offenses` or `execute-slash`) — builds the request and enqueues.
No `eth_simulateV1`.

Real reverts on any of these are caught by the bundle simulate at send
time, which drops the failing entry and proceeds with the survivors.
4. **`publisher.validateBlockHeader` (NEW: pre-broadcast)** — replaces
the old `simulateProposeTx`-at-enqueue. Calls
`validateHeaderWithAttestations` with empty attestations and
`ignoreSignatures: true` so the rollup runs the header checks (archive
match, slot match, timestamp, mana-min-fee, …) without needing real
attestations. Runs **before** we gossip the proposal to peers. If it
fails, abort the slot — log an error, emit `header-validation-failed`,
don't broadcast, don't enqueue.
5. **`prepareProposeTx → validateBlobs estimateGas`** — kept as the
blob-commitment **consistency check** (detects locally-built commitments
not matching the blob sidecars). Returns `blobEvaluationGas`, which we
stash on the propose `RequestWithExpiry` for use by the bundle gasLimit
later. The simulate-step that previously paired with this
(`simulateProposeTx`) is removed.

### Background pipeline, in
`waitForAttestationsAndEnqueueSubmissionAsync`

6. **`publisher.simulateInvalidateCheckpoint` (conditional)** — runs
**only** in the fallback path where attestation collection failed AND
the pending chain turned out to be invalid. Triggered from
`CheckpointProposalJob.enqueueInvalidation`. This is the second, late
trigger for invalidation simulation — distinct from step 2's pre-build
trigger.

### Send time, in `sendRequestsAt(targetSlot)`

7. **Bundle simulate (NEW)** — single `eth_simulateV1` of the assembled
`aggregate3` payload, with `block.timestamp` overridden to the start of
`targetSlot`, and state overrides = `[disableBlobCheck]` iff `propose`
is in the bundle and `[]` otherwise. Per-entry result decoded from the
returned `Result[]`. This is the **only** post-pipeline-sleep
re-validation; it replaces the per-request `preCheck` mechanism
entirely.
8. **Bundle re-simulate (NEW, conditional)** — runs **only** when step 7
dropped at least one entry. Re-runs the bundle simulate on the reduced
payload to get an honest `gasUsed`, and applies the same per-entry
decode so additional drops are caught. If the re-simulate falls back
(node doesn't support `eth_simulateV1`), the publisher sends the
**first-pass survivors only** with `MAX_L1_TX_LIMIT`; the entries that
the first pass already proved would revert stay dropped and are reported
as failed actions.

### Post-send

No diagnostic-only simulate paths remain. `Multicall3.forward` throws
`MulticallForwarderRevertedError` on a reverted receipt and re-throws on
a send error; per-request revert resimulation has been removed.

## Known caveats

- **`sendRequestsAt` early lead**: sleeps until `startOfTargetSlot -
ethereumSlotDuration` to maximise inclusion in the first L1 block of the
L2 slot. There is a known correctness risk: a tx mined in the L1 block
immediately preceding the L2-slot boundary would revert via
`ProposeLib.validateHeader`'s `slot ==
block.timestamp.slotFromTimestamp()` check. In practice the prior L1
block is usually already committed before this send wakes; if observed
to be unreliable in production, tune the lead down, especially on tests.
- **`validateBlockHeader` pre-broadcast coverage**: covers the
`validateHeader` checks (archive, slot/timestamp, mana-min-fee, …) and
the empty-attestation path of `validateHeaderWithAttestations`, but does
NOT cover proposer-signature verification, inbox consumption
(`Rollup__InvalidInHash`), or `header.inHash` match. Those still execute
inside the full `propose` and are caught by the bundle simulate at send
time. The cost of a rare miss is one wasted broadcast.
- **Top-level `aggregate3` revert diagnostics removed**: the previous
`Multicall3.forward` code decoded receipt-reverted reasons via
`tryGetErrorFromRevertedTx` and did a per-request resim on send-throw.
Both paths are gone. With `allowFailure: true` and `Multicall3.hasCode`
covering the no-bytecode case, a reverted forwarder receipt is genuinely
unexpected (OOG, forwarder bug). The throw of
`MulticallForwarderRevertedError` is the only diagnostic surface —
operators will need the transaction hash from the log to investigate.
…ses (#23249)

## Motivation

The `RevertCode` and `TxExecutionResult` types each carried three
deprecated aliases (`APP_LOGIC_REVERTED`, `TEARDOWN_REVERTED`,
`BOTH_REVERTED`) that all collapse to the same `REVERTED` value. Keeping
them around adds noise, requires `no-duplicate-enum-values` eslint
suppressions, and lets new code keep reaching for the old names.

## Approach

Removed the deprecated members from both enums and rewrote every call
site to use `REVERTED` directly. Tests, fixtures, and a stale doc
reference were updated to match.

## Changes

- **stdlib**: Drop deprecated
`APP_LOGIC_REVERTED`/`TEARDOWN_REVERTED`/`BOTH_REVERTED` from
`RevertCode` and `TxExecutionResult`.
- **simulator, pxe, aztec.js, end-to-end (tests)**: Replace remaining
references with `REVERTED`.
- **simulator/docs**: Update a stale `APP_LOGIC_REVERTED` reference in
the public-tx-simulation doc.
…23259)

Addresses a config-timing race in
`epochs_invalidate_block.parallel.test.ts > "proposer invalidates
multiple checkpoints"` that caused intermittent CI failures with
`expect(validCount).toBeLessThan(quorum)` (e.g. 5/6 attestations when
quorum=5).

## The race

The test reads `currentSlot` via `monitor.run()` right after waiting for
the first checkpoint to land — that read can land anywhere within the
current L2 slot, including near its end. It then computes `badSlot1 =
currentSlot + 2` and races to push malicious config
(`skipCollectingAttestations: true`, …) to that slot's proposer via
`await node.setConfig({...})`.

`CheckpointProposalJob` is constructed with `this.config` passed by
reference (`sequencer-client/src/sequencer/sequencer.ts:559`), and
`Sequencer.updateConfig` reassigns `this.config = merge(...)` rather
than mutating, so a job built before `setConfig` lands keeps the old
config object. Under proposer pipelining
(`PROPOSER_PIPELINING_SLOT_OFFSET = 1`,
`epoch-cache/src/epoch_cache.ts:26`), the job for `badSlot1` is built
during the last L1 slot of L2 slot `badSlot1 - 1`. With 32s L2 slots and
8s L1 slots, that's ~24s into the previous L2 slot — so if `currentSlot`
was read late, badSlot1's proposer can snapshot the old config before
our `setConfig` round-trip completes.

## Fix

- Wait for an L2 slot boundary (`monitor.waitUntilNextL2Slot()`) before
reading `currentSlot`, so we start from the beginning of a slot rather
than wherever we happened to land.
- Bump the gap from `+2/+3` to `+3/+4` for a second slot of margin.

Cost is up to one additional L2 slot of test runtime in the worst case;
the existing 8-slot wait window for both checkpoints still fits.
`sendBatchRequest` became unused after removing the slow tx flow and the
old tx reqresp method.

This PR removes sendBatchRequest and cleans up code that becomes unused.

It does NOT remove subprotocol validator registration/etc from reqresp.
This might be done in a follow-up depending on how
https://linear.app/aztec-labs/issue/A-1014/block-txs-reqresp-validator-validaterequestedblocktxs-is-never-invoked
becomes solved.
ProposalTxCollector doesn't exist anymore.

Clean up unused files.

Rename bench that is now only testing BatchTxRequester.
Remove the single-checkpoint-proposal map in favour of the "by hash"
variant.
Checks that inHash, archive, and sig ctx match. Should catch errors
during construction.
…23257)

Addresses [Phil's review
comment](#23165 (comment))
on #23165: uses the injected `DateProvider` instead of `new Date()` for
the pre-gas-estimation timeout check in `L1TxUtils.sendTransaction`, so
tests can drive the clock.
## Motivation

The local network sandbox (`aztec start --local-network`) historically
ran without proposer pipelining, so the compose-routed e2e suite
(`src/composed/*`, `src/guides/*`, cli-wallet flows, docs examples,
playground) never exercised the pipelined sequencer path. Turning
pipelining on revealed that each L2 slot took a full real-time slot (~72
s) before the L1 multicall fired, blowing up sandbox boot from ~30 s to
~5 min, because the existing `AnvilTestWatcher` triggers don't fire in
the pipelined-publish window.

## Approach

First commit flips `SEQ_ENABLE_PROPOSER_PIPELINING=true` on the three
sandbox-test compose envs so every compose-routed test runs through the
pipelined path. Second commit teaches `AnvilTestWatcher` about the
proposer's target slot by hooking the sequencer's `block-proposed` event
in `createLocalNetwork`; when the proposer has built a block destined
for a slot beyond L1, the watcher warps L1 (and, via `cheatcodes.warp`,
the injected date provider) forward, waking the pipelined publisher's
`sendRequestsAt` sleep and the upstream
`waitForValidParentCheckpointOnL1` wait. `block-proposed` is used rather
than the cleaner `state-changed → PUBLISHING_CHECKPOINT` because the
latter only fires *after* `waitForValidParentCheckpointOnL1` unblocks —
which is what we are trying to break — so it would be circular.

## Changes

- **yarn-project/end-to-end, docs/examples/ts, playground (compose)**:
add `SEQ_ENABLE_PROPOSER_PIPELINING=true` to the `local-network` /
`aztec` service env so every compose-routed sandbox test runs pipelined.
- **yarn-project/aztec (`AnvilTestWatcher`)**: new
`setProposedTargetSlot` setter and a `warpTimeIfNeeded` branch (gated on
`isLocalNetwork`) that warps L1 to the target slot's timestamp when it's
ahead of L1.
- **yarn-project/aztec (`createLocalNetwork`)**: subscribe to the
sequencer's `block-proposed` event and forward `slot` to the watcher.

Verified locally: sandbox boot drops from ~5 min back to ~27 s under
pipelining, and `e2e_local_network_example.test.ts` (both tests) passes
in ~33 s.
… pipelining (#23302)

## Summary

Fixes the `e2e_p2p_broadcasted_invalid_block_proposal_slash` failure
that has been blocking the `merge-train/spartan` train (run
https://github.com/AztecProtocol/aztec-packages/actions/runs/25896899879,
test log http://ci.aztec-labs.com/2bf4e2cd2d9e7944).

The test creates the malicious proposer first (auto-starting its
sequencer) and only later creates the honest nodes and waits for P2P
mesh. Under `enableProposerPipelining: true` (turned on for this test by
#23070), the malicious proposer is selected for the very next slot,
builds + broadcasts the invalid proposal one slot ahead, and lands the
broadcast before the honest validators have joined the mesh. They then
reject it at the gossipsub `checkpoint_proposal_validator` with
`Penalizing peer for invalid slot number` (since their target slot has
already moved past), so the `state_mismatch` slashing path never runs.
The malicious sequencer then gets stuck on the failed publish (`Awaiting
pending L1 payload submission`) and never proposes again before the test
times out on `awaitOffenseDetected`.

This is the same race that #23070 fixed in
`duplicate_proposal_slash.test.ts`; the same pattern is applied here:

- Create both the invalid proposer and the honest nodes with
`dontStartSequencer: true`.
- After P2P mesh connectivity + committee formation, use
`advanceToEpochBeforeProposer` to land one epoch before an epoch where
the invalid proposer is scheduled.
- Start all sequencers, then `advanceToEpoch(targetEpoch, { offset:
-AZTEC_SLOT_DURATION })` so the malicious slot fires while every node is
online and at the same wall-clock slot.
- After `awaitOffenseDetected` on one node, poll `getSlashOffenses`
across **all** nodes for `BROADCASTED_INVALID_BLOCK_PROPOSAL` — under
pipelining a given receiver may have already advanced past the build
slot when the proposal arrives, so we need to catch whichever node was
still in the build slot.

The on-chain slash assertion (`rollup.listenToSlash`) is preserved
unchanged.

Full failure analysis:
https://gist.github.com/AztecBot/39b69c1117f419145938ccd2c198f8e9

## Test plan

- CI: `e2e_p2p_broadcasted_invalid_block_proposal_slash` passes on
`merge-train/spartan`.
- Local `./bootstrap.sh ci` / `fast` / `build` are not runnable in this
container (no Docker socket and `$HOME` not writable for the container
UID — `yarn install` fails on `corepack` mkdir, parallel-bootstrap can't
create `~/.parallel`). Fix is a direct port of a pattern already
shipping green on `next` via the sibling
`duplicate_proposal_slash.test.ts`.


ClaudeBox log: https://claudebox.work/s/06a4929a1971beaf?run=1
Prevents the archiver from reporting invalid L2 tips by querying all
chain tips within a db transaction. Moves the responsibility of
assembling the tip data to the block store itself to minimize the number
of queries to the db. Clamps proven and finalized tips such that an
incorrect L1 sync still results in finalized <= proven <= checkpoints.
And adds explicit assertions that tips are ordered.

Also adds a guard in the tips store that prevents from deleting block
hashes that are still alive by a given chain tip, instead of assuming
that the finalized chain tip is always the oldest one.

This should catch errors where the block stream breaks due to a
finalized chain tip running ahead of a proven chain tip.

Note that this PR does NOT enforce ordering at the L2Tips struct itself,
since consumers (ie the ones that report the "local" chain tips) may
break this contract (see A-1061). This PR is a simpler alternative to
#22964. Fixes A-1018.
…pelining (#23296)

## Problem

`Sequencer.tryVoteWhenEscapeHatchOpen` constructed `CheckpointVoter`
with the wall-clock `slot` and called `publisher.sendRequestsAt(slot)`.
Under proposer pipelining we are the elected proposer for `slot + 1`
(`targetSlot`), and the multicall is expected to mine in `targetSlot`.
`EmpireBase.sol::_internalSignal`:

- Verifies the EIP-712 digest against the **mining-slot** signature
- Checks `msg.sender == getCurrentProposer()` for the **mining slot**

Both fail under pipelining because we're the proposer for `targetSlot`,
not `slot`. The multicall reverts silently inside Multicall3 and every
governance/slashing entry is dropped.

## Fix

Thread `targetSlot` through `tryVoteWhenEscapeHatchOpen` and use it for
both:

- `CheckpointVoter` (binds the EIP-712 signature to `targetSlot`)
- `publisher.sendRequestsAt(targetSlot)` (delays submission so the tx
mines in `targetSlot`)

This mirrors `tryVoteWhenSyncFails` and `CheckpointProposalJob.execute`,
which already use `targetSlot` correctly. When pipelining is disabled
`targetSlot == slot` (from
`epochCache.getTargetEpochAndSlotInNextL1Slot()`), so `sendRequestsAt`
resolves with no extra sleep and the legacy behaviour is preserved.

## Showcase

Re-enables `e2e_sequencer/escape_hatch_vote_only.test.ts` with
`enableProposerPipelining: true` and `inboxLag: 2`. The test asserts
`finalStats.votes >= slotsPassed` over the escape-hatch window — this
assertion fails without the fix because no votes ever land.

Test-side adjustments for the pipelined timing model:

- Move event listener attachment to **after** the warp into the
escape-hatch epoch. Checkpoint proposals in flight at warp time fail
their L1 propose tx and are setup-warp artifacts, not vote-only window
failures.
- Snapshot `slotAtMeasurement` for the vote-count lower bound, then wait
for the L1 slot to advance two more so the trailing vote (signed in
build slot N for target slot N+1) has time to mine before counting.
## Motivation

Under proposer pipelining, the checkpoint job opens a world-state fork
with `closeDelayMs: 12_000`. If a pending-chain unwind or historical
prune destroys that fork on the C++ side before the delay fires,
`DELETE_FORK` rejects with `"Fork not found"`, producing a stray warn
log and leaking the per-fork queue entry in the JS instance — one dead
entry per affected pipelined slot.

## Approach

Make `close()` idempotent via an in-flight `closePromise`, and treat
`"Fork not found"` as benign on close (same precedent as the existing
`"Native instance is closed"` suppression — fork IDs are monotonic and
never reused). Also wrap the per-fork queue cleanup in `try/finally` in
both the native and IPC instances so the JS-side queue map cannot
outlive the native fork on error.

## Changes

- **world-state**: `MerkleTreesForkFacade.close()` is now idempotent and
swallows `"Fork not found"`; per-fork queue cleanup in
`NativeWorldStateInstance` and `IpcWorldStateInstance` moved to
`finally`.
- **world-state (tests)**: Regression test that disposes a
`closeDelayMs` fork, triggers an unwind that destroys it on the C++
side, and asserts no warn is logged and the queue entry is cleaned up.

Fixes A-1055
@AztecBot AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels May 15, 2026
@ludamad ludamad force-pushed the merge-train/spartan branch from 4af2626 to db4ec58 Compare May 16, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants