rpc/jsonrpc, node/ethconfig: enable FcuBackgroundCommit by default#21293
rpc/jsonrpc, node/ethconfig: enable FcuBackgroundCommit by default#21293yperbasis wants to merge 11 commits into
Conversation
Routes head-sensitive RPC reads (ReadCurrentHeader, ReadHeadHeaderHash, GetLatestBlockNumber) through the SharedDomains overlay via existing Filters.WithOverlay/WithTemporalOverlay, so the FCU response can return before the MDBX commit lands without RPC consumers seeing stale chain heads. Coherent cache for remote rpcdaemon already receives pre-commit StateChanges and serves the matching state-version root. See #21008 for the motivating regression (~+80ms p50 FCU latency + multi-second burst tails on main vs release/3.4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Enables FcuBackgroundCommit by default and updates JSON-RPC handlers to read head-sensitive data through the SharedDomains overlay so RPC responses can reflect the new FCU head before the MDBX commit/fsync lands.
Changes:
- Flip
ethconfig.Defaults.FcuBackgroundCommittotrueand update the rationale comment to reflect current overlay + notification/coherent-cache wiring. - Wrap a set of head-sensitive
rawdb.Read*/rpchelper.Get{Latest,Safe,Finalized}BlockNumbercall sites withfilters.WithOverlay/WithTemporalOverlay. - Minor call-site refactors (introducing
overlayTxlocals) to reuse overlay-backed transactions.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| rpc/jsonrpc/txpool_api.go | Read current header via overlay for txpool content endpoints. |
| rpc/jsonrpc/trace_filtering.go | Use overlay-backed head hash/number when ToBlock is omitted. |
| rpc/jsonrpc/parity_api.go | Read current block number via overlay for parity storage-key listing. |
| rpc/jsonrpc/overlay_api.go | Use overlay-backed latest block number when capping ranges. |
| rpc/jsonrpc/graphql_api.go | Resolve “latest” block number via overlay for GraphQL API. |
| rpc/jsonrpc/eth_txs.go | Read current header via overlay for pending-tx responses. |
| rpc/jsonrpc/eth_system.go | Use temporal overlay for gas oracle backend and overlay for head header reads (base fee/blob base fee). |
| rpc/jsonrpc/eth_simulation.go | Compare requested block to overlay-backed latest head to avoid “future block” false positives. |
| rpc/jsonrpc/eth_receipts.go | Use overlay-backed latest block number in max-uint32 capping path. |
| rpc/jsonrpc/eth_call.go | Use overlay-backed latest block number when building proof/witness preconditions. |
| rpc/jsonrpc/eth_block.go | Use overlay-backed latest block number for transaction-count-by-number guard. |
| rpc/jsonrpc/erigon_receipts.go | Use overlay-backed latest block number for log range defaults. |
| rpc/jsonrpc/erigon_block.go | Read current header via overlay for timestamp-based block lookup. |
| rpc/jsonrpc/debug_execution_witness.go | Use overlay-backed latest block number when building expected post-state. |
| rpc/jsonrpc/debug_api.go | Use overlay-backed latest block number for debug_setHead baseline. |
| rpc/jsonrpc/bor_api_impl.go | Use overlay-backed latest header/number in several Bor RPCs. |
| node/ethconfig/config.go | Default FcuBackgroundCommit to true and expand explanatory comment. |
Comments suppressed due to low confidence (4)
rpc/jsonrpc/parity_api.go:80
bnis read through the overlay, but subsequent reads (_txNumReader.Minandtx.RangeAsOf) still use the original temporal tx. During a background-commit window this can leaveparity_listStorageKeysquerying state/txnums from the committed DB while targeting the overlay head number, producing stale or inconsistent results. Consider wrapping the temporal tx once (e.g., viafilters.WithTemporalOverlay) and using that wrapped tx consistently for the latest-state reader, txnum lookup, and theRangeAsOfscan.
bn := rawdb.ReadCurrentBlockNumber(api.filters.WithOverlay(tx))
minTxNum, err := api._txNumReader.Min(ctx, tx, *bn)
if err != nil {
return nil, err
}
rpc/jsonrpc/trace_filtering.go:351
toBlockis now derived from an overlay-backed head hash/number, but the rest of the method still passes the originaldbtxintofilterV3, which computes txnum bounds and scans indexes against the committed DB view. This means the requested range can reference an overlay head while the underlying trace scan is still anchored to the pre-commit state (and the block-range limit check uses the overlay height). Consider using a temporal overlay tx (filters.WithTemporalOverlay(dbtx)) and passing that through tofilterV3(and any txnum/index reads) so the head number and the scanned data come from the same view.
if req.ToBlock == nil {
overlayTx := api.filters.WithOverlay(dbtx)
headNumber, err := api._blockReader.HeaderNumber(ctx, overlayTx, rawdb.ReadHeadHeaderHash(overlayTx))
if err != nil {
return err
}
toBlock = *headNumber
rpc/jsonrpc/bor_api_impl.go:111
latestBlockNumis resolved using an overlay-backed tx, but the subsequentHeaderByNumbercall still uses the originaltx. During background commit this can cause the latest header lookup to miss the overlay head (returningerrUnknownBlockeven though the overlay has the header). Consider reusing the same overlay-wrapped tx for theHeaderByNumber/HeaderByHashreads in the “latest” path.
//nolint:nestif
if blockNrOrHash == nil {
latestBlockNum, err2 := rpchelper.GetLatestBlockNumber(api.filters.WithOverlay(tx))
if err2 != nil {
return accounts.NilAddress, err2
}
header, err = api._blockReader.HeaderByNumber(ctx, tx, latestBlockNum)
} else {
rpc/jsonrpc/eth_block.go:383
- This method now uses an overlay-backed tx to compute
latestBlockNumber, but it still reads the block body/tx count through the originaltxlater in the function. In a background-commit window,blockNumcan be the overlay head while_blockReader.Bodyon the committed tx returns nil, so the RPC may still returnnullforlatest. Consider using the same overlay-wrapped view for the subsequent body read when serving head-sensitive queries.
latestBlockNumber, err := rpchelper.GetLatestBlockNumber(api.filters.WithOverlay(tx))
if err != nil {
return nil, err
}
if blockNum > latestBlockNumber {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The skip dated to PR #20195 when bg-commit had a documented "FCU N+1 reads stale state from DB" race. The actual race is in the tester helper, not the FCU: insertPoSBlocks waits on pre-commit state-change events from the gRPC stream, then InsertChain opens an roTx and reads ReadHeader/HeadBlockHash before the bg goroutine has committed. Fix in InsertChain itself by calling ExecModule.WaitIdle (acquires then releases the FCU semaphore — the bg goroutine releases it only after commit). With foreground commit the semaphore is already free, so WaitIdle is an instant no-op. FCU sequencing was never the issue: forkchoice.go:158's TryAcquire + retryBusy-on-Busy serialises consecutive FCUs against the prior bg goroutine. The test passes under -race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…BlockHash When erigon runs in import mode under FcuBackgroundCommit=true, the post-UpdateForkChoice write at the end of import_cmd.go::InsertChain races the bg-commit goroutine: it can land WriteHeadBlockHash(lvh) in MDBX before the bg goroutine flushes the overlay containing Headers / BlockHash entries for the same block. Result on next startup: ReadHeadBlockHash returns the hash, HeaderNumber(hash) returns nil, BlockReader.CurrentBlock dereferences the nil and panics (freezeblocks/block_reader.go:1409). This made hive ethereum/rpc-compat fail at runner startup of the second erigon process. The existing wait on the state-change stream only ensures the dispatcher fired (events are dispatched pre-commit from the overlay), not that MDBX has flushed. Add ExecutionModule().WaitIdle() before the Update — WaitIdle acquires/releases the FCU semaphore, blocking until the bg goroutine completes its flush+commit. No-op when bg-commit is off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Manually dispatched the three CI workflows that were skipped because the PR is in draft state:
Why I want them: this PR flips |
Copilot review caught a class of bugs in the FcuBackgroundCommit migrations where the head lookup was overlay-wrapped but the dependent read on the same tx was not. During the ~50ms bg-commit window the head returns N+1 (from the SD overlay) while MDBX is still at N, so the dependent read references a head the rest of its view can't see. - bor_api_impl.go (GetAuthor): pass the same overlayTx to _blockReader.HeaderByNumber. Previously latestBlockNum was N+1 but HeaderByNumber read MDBX for N+1's canonical hash → nil → errUnknownBlock. - eth_block.go (GetBlockTransactionCountByNumber): pass overlayTx to _blockReader.Body. Without this, blockNum can be N+1 while Body returns nil from committed MDBX, dropping the RPC to (nil, nil). - parity_api.go (parity_listStorageKeys): revert to plain tx. There is no good overlay-aware version: the block overlay exposes table writes (TxNums included) but not the SD domain mem batch, so a RangeAsOf over kv.StorageDomain would still bypass the pending storage writes for an overlay-derived block number, yielding an inconsistent view. - trace_filtering.go: revert the toBlock-via-overlay change. filterV3 scans receipts/logs against dbtx; routing only the toBlock through the overlay leaves the upper bound past what the scan can see. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep of the remaining migrated callsites for the same head-vs-data inconsistency Copilot flagged in the first review pass: the head lookup was routed through the SD overlay but the downstream reads (log scans, witness/proof builds, setHead, blockWithSenders, binary search) still operate on the plain committed tx, so during the bg-commit window the function references a head its own follow-up reads cannot see. Reverted to plain tx (consistent committed view) in: - eth_call.go (GetProof / CreateAccessList witness paths) - overlay_api.go and erigon_receipts.go and eth_receipts.go (log range upper bounds — getLogsV3 scans against plain tx; eth_getLogs additionally guards on GetLatestExecutedBlockNumber(tx) which would otherwise fire "node is still syncing" for an overlay-derived upper bound) - eth_simulation.go (guard followed by blockWithSenders on plain tx) - debug_api.go (debug_setHead guard; SetHead acts on committed DB) - debug_execution_witness.go (latestBlock gates a branch that reads txnums + seeks commitment on plain tx) - erigon_block.go (GetBlockByTimestamp binary search uses HeaderByNumber on plain tx) Kept overlay-aware where the function uses the result in-memory and does no dependent DB read, or where the dependent read was extended in the previous commit (bor_api_impl.go GetAuthor, eth_block.go GetBlockTransactionCountByNumber): eth_system.go BlockNumber/GasPrice/BaseFee/BlobBaseFee (in-memory) and the GasPriceOracleBackend tx (wrapped once at construction so all b.tx reads are overlay-aware), bor_api_impl.go header-only paths, txpool_api.go Content/ContentFrom, eth_txs.go pool fallback, graphql_api.go GetLatestBlockNumber, trace_filtering.go was already reverted in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Re-dispatched the three skipped workflows against HEAD
|
Pre-existing latent panic: rawdb.ReadCurrentBlockNumber returns *uint64 which is nil when HeadHeaderHash isn't set (fresh DB / pre-head state), and the *bn dereference on the next line would crash the RPC handler. Surface a user-facing error instead. Flagged by Copilot review of the nearby diff in 79ce7d9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eth_getLogs, trace_filter.Filter, and overlay_api.getBeginEnd previously resolved the implicit `latest` reference through nil filters (plain tx, returns N-1 during the bg-commit window) but routed user-passed ToBlock/FromBlock="latest" through api.filters (overlay-aware, returns N). The resulting `end > latest` guard would false-positive errBlockRangeIntoFuture for ~50ms after every FCU. Pass nil filters consistently in the explicit-tag resolution so the upper bound, the lower bound, and the underlying getLogsV3/filterV3 scan all agree on the committed view. Pre-existing in foreground-commit mode too, but only observable during the brief commit window when the CL hadn't yet received the FCU response; becomes user-facing with FcuBackgroundCommit=true defaulting on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ByTimestamp and debug_setHead Two callsites previously reverted to plain tx in 79ce7d9 only touch block tables (no SD-temporal reads), so overlay-aware reads are fully consistent with their dependent reads. - erigon_block.GetBlockByTimestamp: wrap tx once at the top; route ReadCurrentHeader, the inner _blockReader.HeaderByNumber binary-search probe, and the three buildBlockResponse callsites through overlayTx. - debug_setHead: read currentHead through the overlay so debug_setHead(N) during the bg-commit window doesn't false-positive "block N is in the future" for what is effectively a no-op rollback. The SD-temporal reverts (eth_call/getProof, log scans, simulation, commitment seek, parity_listStorageKeys) still need the SD-aware temporal view tracked in #21314. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mmit semantics and known limitation Document the trade-offs of routing head reads through the SD block overlay during the bg-commit window: - node/ethconfig/config.go: expand the FcuBackgroundCommit doc block to cover the FCU semaphore sequencing, embedded vs. remote rpcdaemon semantics, and the eth_call(latest) header-vs-state lag (~50ms). Link to #21314 for the proper SD-aware view follow-up. - rpc/jsonrpc/eth_call.go: note that BlockOverlayTemporalTx wraps table reads but delegates temporal methods to roTx; link to #21314. - rpc/jsonrpc/eth_simulation.go: comment was wrong — blockWithSenders auto-wraps. The real reason latest stays on plain tx is NewSharedDomains ties the simulator to the plain tx's domain state. - rpc/jsonrpc/eth_system.go: explain why GasPriceOracleBackend.Fork opens a non-overlay-wrapped tx (downstream helpers re-wrap internally). - execution/execmodule/exec_module_test.go: replace stale "subsequent blocks may fail validation" comment on TestNotificationDispatchBackgroundCommit — the semaphore serializes FCUs so N+1 always reads N's committed state. - cmd/utils/flags.go + docs (configuring-erigon.mdx, llms-full.txt): update --fcu.background.commit usage and default; flag a concise rpcdaemon caveat in the description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ay safe-close invariant and assert memStore backing RPC readers acquire views via Filters.WithOverlay / WithTemporalOverlay / SharedDomains.BlockOverlayTemporalTx that share memTx with the published SD's BlockOverlay. The FCU bg-commit goroutine closes that SD while those readers are still iterating, and the only thing keeping the pattern safe today is that the BlockOverlay is backed by membatchwithdb.NewMemoryBatch — a pure-Go memStore whose Rollback/Close are no-ops on the in-memory data (memory_store.go), and that views hold their own caller-supplied backing tx. Both properties were load-bearing but undocumented; flagged in the FcuBackgroundCommit=true review as a latent risk. Make the invariant explicit: - MemoryMutation.Rollback (memory_mutation.go): comment explaining the no-op semantics for memStore-backed batches and the consequence for concurrent read views. - MemoryMutation.NewReadView (memory_mutation.go): concurrency note pointing to newReadViewMut. - MemoryMutation.newReadViewMut (memory_mutation.go): runtime panic if m.memTx is not *memStore. Catches the case where someone swaps NewMemoryBatch for NewMemoryBatchMDBX (which DOES invalidate cursors on Rollback) and forgets that the safe-concurrent-close property vanishes. The panic message points at the required follow-up (refcount/drain) for any MDBX-backed overlay with concurrent readers. - SharedDomains.Close (domain_shared.go): comment walking through each of sd.mem / sd.blockOverlay / sd.sdCtx and why concurrent RPC views remain safe under each close path. - Filters.WithOverlay / WithTemporalOverlay (filters.go): public-API documentation of the safe-concurrent-close property and the standard caller-tx-lifecycle constraint that still applies. No behaviour change for the supported (memStore-backed) path; the panic fires only if an unsupported backing store is wired in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (4)
rpc/jsonrpc/bor_api_impl.go:115
- When
blockNrOrHash.Number()is a tag (e.g. "latest", "pending", "safe", "finalized"), converting it touint64yields a huge number and results in an incorrect lookup. This code path should resolve tags explicitly (e.g. viarpchelper.GetBlockNumberor by handlingrpc.LatestBlockNumberetc.) before callingHeaderByNumber.
header, err = api._blockReader.HeaderByNumber(ctx, overlayTx, latestBlockNum)
} else {
if blockNr, ok := blockNrOrHash.Number(); ok {
header, err = api._blockReader.HeaderByNumber(ctx, tx, uint64(blockNr))
} else {
rpc/jsonrpc/eth_receipts.go:162
- Calling
rpchelper.GetBlockNumber(..., nil)can panic ifcrit.ToBlockis thependingtag:_GetBlockNumberdereferencesfiltersin thePendingBlockNumbercase (filters.LastPendingBlock()). Consider explicitly rejectingpendinghere (committed-only scan), or ensure pending resolution is handled safely.
if toBlock > 0 {
end = uint64(toBlock)
} else {
blockNum := rpc.BlockNumber(toBlock)
end, _, _, err = rpchelper.GetBlockNumber(ctx, rpc.BlockNumberOrHashWithNumber(blockNum), tx, api._blockReader, nil)
rpc/jsonrpc/overlay_api.go:624
rpchelper.GetBlockNumber(..., nil)can panic on thependingtag because_GetBlockNumbercallsfilters.LastPendingBlock()without a nil check. Ifcrit.ToBlockcan bepending, either validate/reject it up front or route pending resolution through a non-nil filters implementation that doesn't enable overlay-basedlatest/safe/finalized.
toBlock := crit.ToBlock.Int64()
if toBlock > 0 {
end = uint64(toBlock)
} else {
blockNum := rpc.BlockNumber(toBlock)
end, _, _, err = rpchelper.GetBlockNumber(ctx, rpc.BlockNumberOrHashWithNumber(blockNum), tx, api._blockReader, nil)
if err != nil {
rpc/jsonrpc/trace_filtering.go:357
rpchelper.GetBlockNumber(..., nil)can panic ifreq.ToBlockispending, because_GetBlockNumberdereferencesfiltersin thePendingBlockNumbercase (filters.LastPendingBlock()). Consider rejectingpendinghere or ensure pending resolution is handled safely without a nil filters pointer.
}
toBlock = *headNumber
} else {
toBlock, _, _, err = rpchelper.GetBlockNumber(ctx, *req.ToBlock, dbtx, api._blockReader, nil)
if err != nil {
85d168b routed eth_getLogs / erigon_getLogs / overlay.* / trace_filter explicit-tag resolution through rpchelper.GetBlockNumber with nil filters, so the upper bound stays on the same committed view that the underlying scan uses. But _GetBlockNumber's PendingBlockNumber branch dereferenced filters.LastPendingBlock() without a nil check — a latent crash that was unreachable while every caller passed non-nil filters. The pre-existing nil-filters call at eth_receipts.go:127 used the LatestExecutedBlockNumber tag and never hit this branch; my new nil-filters calls forward user-supplied tags, so a user passing FromBlock/ToBlock="pending" would now panic. Fall back to plainStateBlockNumber when filters is nil — same as the existing path when filters is set but holds no pending block — which is the right semantics for the log-range scan callsites (they can only see committed data anyway). Caught by the Copilot reviewer on the latest PR push (29db691... review of ab060ab). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mh0lt
left a comment
There was a problem hiding this comment.
I have a PR which should land tomorrow which removes the additional semaphore we have arounf the commit in background mode which make the saync commit slower for fcu. When I did this I found a few rpc methods still where not reading the right overlay.
You may want to wait for my PR before merging this. I'll add a pr no once I raise it.
Summary
Enables
FcuBackgroundCommitby default so the FCU response returns to the consensus client before the MDBX flush+commit lands. The commit runs in a background goroutine; the FCU semaphore still serializes successive FCUs, so FCU N+1 always reads FCU N's committed state.Motivated by #21008 — local profiling on mainnet at tip vs
release/3.4showed p50 FCU latency regressed by ~+80 ms (mean ~+90 ms, p99 ~+160 ms) with wider burst tails. Background commit decouples the FCU response from the MDBX fsync so the CL gets ACK on the overlay-published head rather than the disk-flushed head.The regression itself was caused by #19961 moving TxLookup / Senders / Finish into the synchronous FCU path via
PipelineExecutor.RunLoop. Background commit masks this by hiding the fsync from the FCU response; under disk-busy bursts those stages still execute on the FCU path ande.semaphorewill block the next FCU on the still-running prior bg goroutine. The root-cause fix (move those stages back to a backgroundStageLoopequivalent torelease/3.4's shape, per #21008 comment) is out of scope here.Changes
Flag flip —
node/ethconfig/config.go,cmd/utils/flags.go, docs. FlipFcuBackgroundCommitdefault totrue. Theconfig.godoc block covers the FCU semaphore sequencing, embedded vs. remote rpcdaemon semantics, and the known limitation below. Flag usage and the configuring-erigon docs carry a one-line caveat (remote rpcdaemon 'latest' lags ~50 ms, stays consistent).Head-sensitive RPC reads — split by what the dependent reads can see:
Overlay-aware (head + dependent block-table reads both visible in the published SD's
BlockOverlay):bor_api_impl.go(GetSnapshot, GetAuthor, GetSigners, getLatestBlockNum, GetSnapshotProposer, GetSnapshotProposerSequence),eth_block.goGetBlockTransactionCountByNumber,eth_system.goGasPrice/BaseFee/BlobBaseFee/MaxPriorityFeePerGas/FeeHistory (including theGasPriceOracleBackendtx, wrapped once at construction),eth_txs.goGetTransactionByHash pool fallback,graphql_api.goGetLatestBlockNumber,txpool_api.goContent/ContentFrom,debug_api.goSetHead,erigon_block.goGetBlockByTimestamp.Plain tx (dependent reads use SD-temporal data — domain/history/inverted-index — which the block overlay does not expose):
eth_call.gogetProof/getWitness,eth_simulation.goSimulateV1,eth_receipts.go/erigon_receipts.goGetLogs,overlay_api.gogetBeginEnd,trace_filtering.goFilter,debug_execution_witness.gobuildExpectedPostState,parity_api.goListStorageKeys (with a nil guard onReadCurrentBlockNumber). For these, resolvinglatestthrough the overlay would land on N while the dependent reads can only reach N-1 — silently wrong, not just lagging. Block-tag resolution in these files passesnilfilters torpchelper.GetBlockNumberconsistently so all bounds, guards, and scans agree on the committed view.rpc/jsonrpc/eth_call.goCall— already usedBlockOverlayTemporalTxfrom #20195. Now carries a comment documenting that block-table reads see overlay-N during the bg-commit window but state reads (GetLatest,GetAsOf, etc.) evaluate against N-1, because the SD's domain mem batch is not exposed by the overlay. The proper SD-aware temporal view follow-up is tracked in #21314.Overlay safe-close invariant —
db/kv/membatchwithdb/memory_mutation.go,db/state/execctx/domain_shared.go,rpc/rpchelper/filters.go. The pattern "RPC reader acquires an overlay view, then the FCU bg-commit goroutine closes the published SD while the reader is still iterating" is safe today only because theBlockOverlayis backed byNewMemoryBatch(pure-GomemStore—Rollback/Closeare no-ops on the in-memory data) and views hold their own caller-supplied backing tx. Both properties were load-bearing but undocumented.MemoryMutation.newReadViewMutnow panics ifm.memTxis not*memStore. Catches the case where a future change swaps inNewMemoryBatchMDBX(which does invalidate cursors onRollback) without adding refcount/drain logic.MemoryMutation.Rollback,MemoryMutation.NewReadView,SharedDomains.Close,Filters.WithOverlay, andFilters.WithTemporalOverlaycarry comments explaining the invariant and the consequence if it is relaxed.Bg-commit ↔ post-FCU RW tx race fix —
cmd/utils/app/import_cmd.go,execution/execmodule/execmoduletester/exec_module_tester.go. Both opened a fresh RW tx (toWriteHeadBlockHashor to read committed state) immediately afterUpdateForkChoicereturned. Under bg-commit that races the bg goroutine writing the header bytes —HeadBlockHashcould land in MDBX before the header itself, crashing the next startup inBlockReader.CurrentBlock. Both now callExecModule.WaitIdleto drain the FCU semaphore first. (Theimport_cmdrace was caught by a real Hive failure during PR testing.)Test unskip —
execution/execmodule/exec_module_test.go.TestNotificationDispatchBackgroundCommitpreviously skipped with the message "FCU N+1 reads stale state from DB". The ExecModule semaphore actually serializes successive FCUs (the bg goroutine releases only afterFlush+Commit), so the race described in the skip cannot happen. Unskipped; comment rewritten.Safety
FCU sequencing.
forkchoice.go:158TryAcquire(1)one.semaphore. In the bg path,shouldReleaseSema=falseso the outer defer doesn't release; the bg goroutine hasdefer e.semaphore.Release(1)(line 658) which fires LIFO afterrunPostForkchoicereturns (commit complete) and afterPublishOverlay(nil).AssembleBlockandValidateChainuse the same semaphore, soengine_newPayload/engine_getPayloadfrom the CL serialize behind a pending bg commit — they'll seeExecutionStatusBusyand retry, never a torn state.Embedded rpcdaemon. Overlay-aware paths read FCU N's block-table writes (canonical hashes, headers, bodies, TDs, stage progress, TxNums, forkchoice markers — all written via the
BlockOverlayatforkchoice.go:247tx = currentContext.BlockOverlay()). Pre-commit: readers see N via overlay. Post-commit (afterPublishOverlay(nil)): readers cascade to MDBX which now has N. No stale window.Remote rpcdaemon.
dispatchNotificationsFromOverlaysends theStateChangeBatchover gRPC pre-commit;kvcache.Coherent.OnNewBlockadvances. The remote tx (gRPC remotedb) readsPlainStateVersionfrom committed MDBX, andCoherent.View(tx)keys by that — so during the bg-commit window remote reads resolve to the matching root for the PSV they observe.PlainStateVersionis incremented inside the commit batch (TemporalMemBatch.Flushatdb/state/temporal_mem_batch.go:787), so PSV moves atomically with the rest of the state. Remote consumers see a consistent (lagging) state, never divergent.Known limitation
eth_call(latest),eth_getBalance(latest),eth_getStorageAt(latest),eth_getCode(latest): during the bg-commit window the header resolves to N (via overlay tables) but state reads evaluate against N-1 because the SD's domain mem batch is not exposed by the block overlay. Bounded staleness (~50 ms), not corruption. The proper SD-aware temporal view that would close this gap is tracked in #21314.Test plan
make lintclean (multiple passes, non-deterministic per CLAUDE.md)make erigon integrationbuildsmake test-shortonrpc/jsonrpc/,execution/execmodule/,db/kv/membatchwithdb/,db/state/execctx/,rpc/rpchelper/TestNotificationDispatchBackgroundCommit(unskipped) passesrelease/3.4levelseth_call(latest)lag matches the documented limitation🤖 Generated with Claude Code