State Cache Consolidation (PR #1 of the perf stack) by mh0lt · Pull Request #21380 · erigontech/erigon

mh0lt · 2026-05-23T18:47:15Z

This PR consolidates all state cache usage across all modes — sync, tip tracking, integration and testing. The intention is that after this the state cache is either on and completely trustworthy, or off to measure performance differences — not off because it unexpectedly breaks things.

Important

Do not merge until this has been run against both parallel and serial CI. The whole point of the consolidation is that the cache behaves identically across exec modes; that has to be CI-confirmed, not just locally, before this lands.

The 74 commits group into three layers that build on each other.

1. `BranchCache` — single aggregator-scope commitment cache

Aggregator-lifetime cache with pinned root + bounded LRU tail (31fdf57893, 2849165e1e, 1593f0e8c0).
Wired into the trie read + encoder write paths (8a2c56513e).
Concurrency contract documented (1593f0e8c0); divergence detector + per-block fingerprint log for verification (f87937e738, 8d6d19a3af, 87f12be644).
Behind the sd.mem chain so unwinds and fork-validations see consistent state (6989427ad9, 9ef3b9c6a0).
Explicit lifetime APIs — pin-aware Invalidate, GetWithOrigin, Clear — for the paths that need to reset cache state (6ea7ff8bb5).

2. Trunk preload pipeline

Recursive PreloadContractTrunk enumeration over the patricia trie (9b7c02aece).
Pinned tier on the cache that survives normal LRU eviction (4410704e7a).
Adaptive controller for promotion / extension / demotion of contract trunks (be2616e6c2, b082c93a4d).
Parallel file-only wave-BFS preload with per-wave budget cap (3e0154f245, 4c9968a437).
Resumable parallel preload + adaptive parallel-mode wiring (9db0d5955b).
MDBX-resident branch prefetch (phase 1) (eb0c0c5803).

3. Cache lifecycle fixes — the "trustworthy or off" guarantee

SD.Unwind invalidates stale BranchCache entries (9ef3b9c6a0).
BranchCache cleared on SetHead unwind (b2a9d649aa).
ValidateChain SD chained to canonical generation for head-extending payloads (733dad5385, 3065d89b9a).
Trie checkpoint key kept out of BranchCache (ae9b3f3ebb).
ResetExec now clears the aggregator's BranchCache (Parallel-exec: route the per-tx writeset through one faithful path; remove normalizeWriteSet/calcState; carry serial-finalize signals on ExecutionResult #21138 fix — see below) (f151829dca).

BUG #21138 — parallel-exec from-0 wrong trie root

This was the last known failing test on the branch (TestFromZero_GenesisAllocPreservedAfterResetReExec/parallel), self-labelled in the test as BUG #21138. PR #21017 explicitly listed it as an open follow-up.

Root cause

ResetExec wipes the commitment DB table but the aggregator's in-memory branchCache was holding entries that still referenced the just-deleted trie nodes. A subsequent from-0 re-exec served those stale entries when computing block 0's commitment, producing a wrong trie root and dropping every genesis-allocated balance that no later block touched. Mainnet manifestation: block 46147, address 0xA1E4380A3B1f749673E270229993eE55F35663b4.

Serial exec happened to skirt the issue; parallel exec hit it directly. The 8 genesis writes flow identically through worker → normalize → apply in both serial and parallel — the divergence is purely cache state surviving ResetExec.

Fix

21-line addition to ResetExec in execution/stagedsync/rawdbreset/reset_stages.go: after the DB-wipe block, open a BeginFilesRo to reach AggregatorRoTx.BranchCache() and call bc.Clear(). The method existed for exactly this purpose ("Use on Reset / fork-validation paths to ensure stale entries from one trie root are not served against a different root"). No caching disabled, no parallel-exec workaround — root cause addressed at the natural pairing site, in line with the PR's "trustworthy or off" intent.

Verification

=== RUN   TestFromZero_GenesisAllocPreservedAfterResetReExec
=== RUN   TestFromZero_GenesisAllocPreservedAfterResetReExec/serial
=== RUN   TestFromZero_GenesisAllocPreservedAfterResetReExec/parallel
--- PASS: TestFromZero_GenesisAllocPreservedAfterResetReExec (0.06s)
    --- PASS: TestFromZero_GenesisAllocPreservedAfterResetReExec/serial (0.04s)
    --- PASS: TestFromZero_GenesisAllocPreservedAfterResetReExec/parallel (0.02s)

Broader test surface green: db/state/..., execution/stagedsync/..., execution/execmodule/..., rpc/jsonrpc/....

State

Rebased onto current origin/main (50 commits behind → 0).
make erigon, make integration, make lint all green (0 lint issues).
Previously-failing tests on this branch (TestAggregator_RebuildCommitmentBasedOnFiles, TestSetHeadCanonicalCleanup, TestFromZero_GenesisAllocPreservedAfterResetReExec) all PASS.
PR ci: matrix-test serial vs parallel exec across the test workflows #21017's parallel-exec correctness commits (1fca63d787, 2fa4c21fc5, 572cb24590) were verified already absorbed in this stack (empty cherry-picks).

🤖 Generated with Claude Code

Pure refactor — behavior preserved. Splits the encoded-branch→cells decode logic out of HexPatriciaHashed.unfoldBranchNode into a free function DecodeBranchInto so the same code is consumed by: - unfoldBranchNode (existing trie unfold path) - future cache populators (decoded-payload BranchCache) - future parallel pre-unfold orchestrator (Stage E) Today these would each have to re-derive the encoded-branch parsing logic. Centralising it ensures one decoder, one set of edge cases, one place to fix any bug in the on-disk format handling. DecodeBranchInto is intentionally PURE — it does not call deriveHashedKeys. Trie callers (which need hashed keys for the fold state machine) follow the decode with their own keccak loop. Cache callers can skip the derive step entirely until the cell is consumed by the trie. Tests: - TestDecodeBranchInto_RoundTrip: BranchEncoder.EncodeBranch produces bytes that DecodeBranchInto recovers cell-for-cell. Property test that keeps the canonical decoder consistent with the canonical encoder. - TestDecodeBranchInto_DeletedFlag: touchMap/afterMap convention with the deleted parameter. - TestDecodeBranchInto_TruncatedInput: clean errors on truncated input (no panic). Plus the existing commitment test suite (incl. trie-mismatch tests in TestBranchData_*) all pass without modification, confirming the refactor preserves unfoldBranchNode's behavior. This is the foundation for the next refactors in the representation-reduction track (see agentspecs/trie-data-pipeline-complexity-tax.md): subsequent PRs will introduce a decoded-payload cache that reads through this same decoder, and will lift unfoldKeyPath as a per-key traversal primitive that the warmer + future Stage E both consume. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduces a new BranchCache type, distinct from WarmupCache and designed for the longer-lived caching the cross-block persistence work needs (step 7 of the representation-reduction sequence). Distinguishing characteristics vs WarmupCache: - Bounded LRU tail with configurable capacity (vs WarmupCache's unbounded map). Suitable for caches that outlive a single Process without unbounded memory growth. - Single pinned slot for the root branch (compact prefix [0x00]). Root never evicts. Atomic-pointer load/store on the hot read path, no lock involved. - dirty-flag + PutIfClean invariants — same semantics as the invariants added to WarmupCache in the previous commit. Lets cross-block writers race safely with fold updates. - Lazy GetDecoded — same lazy-decode pattern as WarmupCache's GetBranchDecoded; cells populated on first decoded-read and cached for subsequent reads. NOT yet wired into the trie's read or write paths. This commit just adds the type, with tests. The trie integration (where this cache plugs into branchFromCacheOrDB and the encoder's PutBranch) is the discussion point at the step 6 boundary — see the conversation captured at this point in the representation-reduction sequence. Today the cache is intended to be ephemeral (per-Process, constructed alongside the trie, dies with it). Step 7 lifts the lifetime to the aggTx level for cross-block persistence; the cache shape (bounded LRU + pinned root + dirty-flag) is in place ahead of that. Tests: - TestBranchCache_RootPinning: root branch lands in pinned slot, deep branches land in LRU tail; per-tier hit counters update independently. - TestBranchCache_RootSurvivesEvictionPressure: root persists when tail is overfilled past capacity. - TestBranchCache_DirtyFlag: PutIfClean refuses dirty entry, unconditional Put replaces and clears dirty. - TestBranchCache_GetDecoded: lazy-decode round-trip with BranchEncoder; cells pointer reused across reads. - TestBranchCache_Invalidate: removes from both tiers. - TestBranchCache_Clear: empties both tiers, resets stats. - TestBranchCache_Stats: deterministic format with per-tier counts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Doc-only addition to BranchCache's package-level comment, capturing the caller invariants the cache assumes and the conditions under which the existing concurrent trie satisfies them. Three caller invariants: 1. Single writer per prefix at any moment. 2. Mark-dirty-then-Put discipline for racing writers. 3. Decoded cells from GetDecoded are read-only (alias entry storage). The current ConcurrentPatriciaHashed satisfies all three by construction: - Mounts partition by first nibble (disjoint prefix spaces, no cross-mount writes to the same key). - Root branch written by single sequential fold post-Wait. - Mount→root grid roll-up is rootMu-protected (in-memory grid only, separate from cache writes). Doc explicitly flags that any future parallel fold redesign (Stage F in agentspecs/stage-e-pre-unfold-design.md) MUST preserve these invariants — particularly the single-writer-per-prefix one, which breaks if parent branches are written incrementally as children complete in parallel. The required coordination layer goes at the orchestrator (per-parent atomic counter; only the last-decrementer writes the parent), NOT inside the cache. The cache's existing primitives (atomic dirty flag, thread-safe LRU, atomic root pointer) are sufficient for that orchestrator to build on. Motivation: Stage F is likely deferred because the bench data shows fold isn't the bottleneck for the canonical SSTORE-bloat workload. But adding the constraint to the cache later (after caching is in production) is much harder than documenting it now — correctness regressions from a missed coordination layer can hide for many blocks. Documenting the contract on the cache itself ensures any engineer touching parallel fold sees it. No code change. Doc-only. All tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 7a of the representation-reduction sequence: per-Process integration of the BranchCache type added in the previous commit. Plumbing: - Trie interface gains SetBranchCache(*BranchCache). HexPatriciaHashed implementation propagates to its branchEncoder. ConcurrentPatriciaHashed implementation propagates the SAME instance to root + all 16 mounts (sharing one cache is correct under the concurrency contract — mounts partition prefix space by first nibble, so cross-mount writes target distinct keys). - InitializeTrieAndUpdates constructs a new BranchCache(default) per trie instance and attaches it. Lifetime today = trie lifetime = per-Process. Future cross-block persistence work (step 7b) lifts this to aggTx scope by constructing the cache one layer up and passing it in. Read path (HPH.branchFromCacheOrDB): L1 WarmupCache (existing) → L2 BranchCache (new) → L3 ctx.Branch. L3 hits with non-empty result populate L2 so subsequent reads hit L2 within the cache's lifetime. L1 stays first because warmup workers may have pre-fetched with prefix-walk-derived freshness. Write path (BranchEncoder.CollectUpdate): - MarkDirty(prefix) BEFORE encode work — protects against concurrent warmup-style writers racing into PutIfClean during the encode (race documented in the cache's Concurrency Contract). - Put(prefixCopy, updateCopy) AFTER ctx.PutBranch succeeds — replaces the dirty entry with fresh canonical bytes. Single writer per prefix per fold step (current sequential fold + first-nibble mount partitioning) means no race on this Put. Lifecycle: HPH.Reset clears the BranchCache when called from the root trie (gated by !hph.mounted). Mounted subtries share the root's cache, so a mount calling Clear would dump entries the root still wants. Carries the invariant from PR #19954 commit 1612d56. Today's expected performance impact: minimal. Per-Process lifetime means cache is empty at Process start, so first reads always miss. The cache helps only branches that are read multiple times within ONE Process — uncommon in current code paths. Step 7b is where the real perf swing comes from (cross-block persistence so block N reads hit branches written by block N-1). This commit is the safe stepping stone: it validates the wire-up end-to-end (read path + write path + concurrency contract + lifecycle) without changing perf characteristics. Bench should match Run I baseline (7.16 mgas/s on canonical SSTORE-bloat block). All commitment tests pass, lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the BranchCache was constructed inside InitializeTrieAndUpdates, giving it per-SharedDomains lifetime. SharedDomains is reconstructed for every batch / many tx boundaries — verified with logs in the prototype (921 fresh BranchCache constructions per bench run) — so the cache started cold every batch and never delivered the cross-block hits the design is about. Per-Process scope kept cache-cleanup correctness simple but defeated the whole point of caching. Place the cache on the commitment Domain struct (one cache per aggregator, matching the pattern on add_execution_context_with_caches where each domain owns its valueCache). ConfigureDomains attaches the cache once after domains are initialised — idempotent, lifetime = aggregator lifetime. AggregatorRoTx exposes BranchCache() returning the commitment domain's cache, so SharedDomains' construction path can fetch it without forcing db/state/execctx to import db/state (db/state already imports execctx via squeeze.go, so the reverse import would create a cycle). The placeholder commitment.BranchCacheProvider interface lets the SD construction path use a duck-typed type assertion on tx.AggTx() any. Plumb the cache through NewSharedDomainsCommitmentContext into InitializeTrieAndUpdates as an explicit parameter; nil falls back to a fresh per-init cache so test helpers without an aggregator still get a valid cache. Reset behaviour: HexPatriciaHashed.Reset no longer calls Clear on the cache. Aggregator-scope persistence requires the cache to survive Reset between commitment calculations. Callers that genuinely need to invalidate the cache (unwind, fork validation) now call ClearBranchCache explicitly. The bench is forward-only so this is a safe change for measurement; an explicit unwind clear path can land in a follow-up commit when needed.

Adds two debug-gated diagnostics for cross-block-cache-lifetime investigations. Both off by default — meant to be flipped on when a correctness regression surfaces and you need to localise *where in the fold path* the cache started lying, instead of waiting for a downstream trie-root-mismatch many blocks later. 1. BRANCH_CACHE_VERIFY: branchFromCacheOrDB cross-checks every L2 (BranchCache) hit against ctx.Branch and increments a divergence counter when bytes disagree. Logs the prefix and both byte forms so the first divergent read shows up directly in the erigon log. BranchCache.VerifyDivergences() exposes the count for assertion in tests. 2. BRANCH_CACHE_FINGERPRINT: SharedDomainsCommitmentContext.ComputeCommitment emits a "[cache-fp]" log at end of every compute with (block, root, cache fingerprint, divergence count). Two builds running the same workload can be diffed offline ("first block at which their fp's differ") to nail down the first block where lifecycle invariants diverged. Fingerprint is an order-independent FNV-1a fold over (key-hash, data-hash) pairs across both root and tail tiers. Adds maphash.LRU.Range so the Fingerprint can iterate the LRU tail without touching recency (Peek under the hood). The wrapper otherwise discards original byte-keys on insert; mixing by hash instead of key is correctness-equivalent up to hash collision, which is acceptable at the working-set sizes here. Motivation: today we just chased a wrong-root regression to commit d673052 (aggregator-scope cache lifetime). Took two full bench cycles (~12 min) to localise. With the divergence detector running, the first divergent read would surface in the erigon log within seconds. With fingerprint logging in two builds (one good, one regressed), the diverging block boundary would be a single grep across two log files.

The deferred-encoding path (CollectDeferredUpdate + ApplyDeferredUpdates) parallelises EncodeBranch + merge work at apply time. The cache correctness consequence: between collect and apply, sd.mem holds the old state while the cache is also unchanged. When apply finally fires (end of Process or duplicate-prefix flush mid-Process), sd.mem advances but the cache isn't touched — CollectDeferredUpdate doesn't have a cache hook (and wiring one breaks TestSharedDomain_RepeatedUnwindAcrossStepBoundary + TestCustomTraceReceiptDomain because it violates an implicit trie-during-Process cache stability invariant). The result on the FV bench: cache stays at the very-first-Process's read-side L3 fallback bytes for prefixes the trie writes, while ctx.Branch advances to each block's actual state. Divergence on every hot prefix (root, root-zone children) starting from block 2. Use CollectUpdate (inline) instead. CollectUpdate writes sd.mem + WarmupCache + BranchCache atomically at fold time via PutBranch — cache mirrors sd.mem at every write, the trie sees its own writes consistently, and the cross-Process cache state matches what post-FCU MDBX commit produced. Loses the encoder's parallel-encoding optimisation, but bench profile is I/O-bound, not encode-CPU-bound, so the trade is favourable. History (ETL) writes are still inline via DomainPut. Splitting that ("sd.mem inline, history queued for flush at FCU") is the proper architectural answer to defer the slow disk work without touching the sd.mem invariant — tracked as a follow-up.

Bisection helper: force branchFromCacheOrDB to skip the L2 BranchCache read path entirely so every read goes via ctx.Branch (sd.mem → MDBX). Cache writes still fire so verify-mode can keep comparing cache vs canonical. Flipping the env at runtime distinguishes "cache holds bad data" (bench passes further with cache reads off) from "deeper compute bug" (same failure regardless). Used in the 2026-05-06 investigation to confirm the cache was actively corrupting block 13's compute on the canonical SSTORE-bloat bench: with cache reads on, wrong-trie-root at block 13. With reads off, blocks 13-16 produce the correct roots and the bench advances to a separate failure at block 17 (unrelated pre-existing bug). Default off; gate via env DISABLE_BRANCH_CACHE_READS=true.

When verifyBranchCache=true and a cache hit disagrees with ctx.Branch, sample sd.mem, sd.parent.mem, and tx-direct (MDBX) for the same prefix and dump all layers in the divergence log line. Comparing those four byte sequences against the cached and canonical bytes pinpoints which state layer holds the bytes the cache disagrees with — the rewriter we need to identify before fixing the canonical-store-divergence bugs the cache currently exposes. Decision matrix (read off the log line): - cache != tx, sd.mem == cache → in-memory writer is fresh, MDBX is stale (commit-timing issue). - cache != tx, tx == ctx.Branch, sd.mem != cache, parent.mem != cache → MDBX has been rewritten by something outside the CollectUpdate write path (collation, file build, squeeze). - cache != ctx.Branch, parent.mem matches ctx.Branch but sd.mem doesn't → parent merge is the source. - cache != ctx.Branch, all of sd.mem / parent.mem / tx == ctx.Branch → cache itself was populated incorrectly (write-side bug). Three changes: 1. SharedDomains.ProbeReadLayers (db/state/execctx/domain_shared.go): public method that samples sd.mem, sd.parent.mem (private field accessed from the same package), and tx.GetLatest. Read-only; copies bytes so callers can hold them past tx lifetime. 2. TrieContext (execution/commitment/commitmentdb/commitment_context.go): add probeSd + probeTx fields populated at trieContext() construction; expose ProbeStateLayers method that delegates to sd.ProbeReadLayers. The local `sd` interface gets the ProbeReadLayers method too so the duck-typed reference can call it without an import cycle to execctx. 3. branchFromCacheOrDB log site (execution/commitment/hex_patricia_hashed.go): on divergence with verifyBranchCache, type-assert ctx for the probe interface and append sd_mem / parent_mem / mdbx fields to the log line. Existing field shape preserved for log parsers; new fields are additive. Pre-existing test failures (TestSharedDomain_RepeatedUnwindAcrossStepBoundary, TestValidateChainAndUpdateForkChoiceWithSideForksThatGoBackAndForwardInHeight) are unchanged — they were failing on the stack before this probe landed and are part of what the divergence work is meant to localise.

Per-write provenance for divergence-detection diagnostics. When a divergence fires (cache hit disagrees with ctx.Branch), we now log which write site produced the cached bytes and when, so we can correlate the bad write against the FCU / build / step timeline. Tag fields added to branchCacheEntry: - origin short label of the write site (e.g. "CollectUpdate", "L3-fallback-read") - writeSeq monotonic counter per BranchCache instance - writeTimeNanos unix nanos at write time Put / PutIfClean signatures take an origin string. Two writers updated: - BranchEncoder.CollectUpdate → "CollectUpdate" - branchFromCacheOrDB L3-fallback Put → "L3-fallback-read" GetWithOrigin returns bytes plus the metadata; uses a non-counting peek so it can be called alongside Get without double-counting hits. The divergence-detection log site at branchFromCacheOrDB now appends cache_origin / cache_seq / cache_t_ns fields. Combine with the existing sd_mem / parent_mem / mdbx fields to localise both who wrote the stale bytes and which layer the canonical value lives in. Pre-existing test failures (TestValidateChainAndUpdateForkChoiceWithSideForksThatGoBackAndForwardInHeight) are unchanged from previous commits.

BranchCache previously sat in front of the sd.mem -> parent.mem -> MDBX read chain (consulted in branchFromCacheOrDB before ctx.Branch). The shared aggregator-scope cache was written from CollectUpdate by every SD running commitment compute, including fork-validator SDs whose writes never reach MDBX. Origin-tagged probe (run-step7b-probe-sdid) showed five distinct SD pointers writing the same prefix to a single cache entry, so any reader whose lineage didn't match the most-recent writer saw bytes that disagreed with MDBX -> wrong trie root from block 13 onward in the canonical SSTORE-bloat fork bench. Layering after this change: Read: sd.mem -> sd.parent.mem -> branchCache -> aggTx (MDBX) Write: sd.mem only (DomainPut path) Flush: sd.mem -> MDBX, then branchCache.Clear() The cache now mirrors MDBX-flushed bytes only. Writers' in-flight bytes live in sd.mem above; cache hits below sd.mem are always equivalent to reading MDBX, so cross-SD pollution is impossible by construction. Cache fills lazily on the MDBX-read path inside sd.GetLatest, and clear-on-flush prevents pre-flush bytes from coexisting with new MDBX state. Per-key invalidation is a follow-up (PR2). BranchCache entries gain a step field so Get returns (data, step, ok) matching the aggTx contract. Without this, sd.GetLatest's cache hit returned step=0 and CheckDataAvailable rejected the boot SeekCommitment with "commitment state out of date". Removed: - cache.Put from CollectUpdate (commitment.go) - cache.Put + divergence detection from branchFromCacheOrDB (hex_patricia_hashed.go); now just calls ctx.Branch - L3-fallback Put (cache fills via sd.GetLatest now) Validated on canonical cold bench (run-step8b): first FCU VALID, all payloads through end VALID, 0 cache divergences, 0 wrong-root errors. Prior probe bench had 23 divergences and INVALID payloads from the fail block onward. Probe scaffolding (SiteIdentity, ProbeStateLayers, divergence counters) left in place for now; can be stripped in a cleanup follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ache state Update the BranchCache type comment to reflect the architectural state after the WarmupCache consolidation (steps 2a-2c, 3, 4): - BranchCache is the single branch cache (WarmupCache deleted) - Aggregator-scope lifetime, plumbed via BranchCacheProvider - Passive store: cache itself never reaches into underlying state - Branch warmer is branch-scoped; no leaf-data prefetch - Block-processing trie walker takes Updates from executor + memoization for siblings — no prefetch needed - Witness / proof generation walker drives its own state reads; if that path turns out to be cold-bound it indicates a need for separate account/storage caches (treat as separate concern with different scope/lifetime/invalidation; do not regrow the branch warmer to cover it) - disk_sto / disk_acc counters on cache-fp log surface any unexpected fall-through to ctx.Account / ctx.Storage as a signal of memoization gap or missing walker-side prefetch Doc-only; no behaviour change.

Adds a third cache tier between the root-pin slot and the LRU tail: a per-prefix pinned map fed by PinEntry. Pinned entries: - Never evict (no LRU pressure on this tier). - Are checked in lookup before the LRU tail, after the root slot. - Survive Put/SD.Flush updates: a Put for a pinned prefix updates the pinned entry in place rather than displacing it. Cross-block correctness via the existing dirty-flag invalidation discipline is preserved — the new bytes land in the same pinned slot. - Carry the same metadata as Put-tier entries (step, origin, writeSeq, writeTimeNanos) so divergence-detection and Stats treat them uniformly. Sized by the preload policy. Intended consumer is the storage trunk preload for big contracts (the 'storage root trunk cache for big accounts' direction): per-contract trunk branches at depth 65-70 get pinned at SD/cache creation, persist for the cache's lifetime. PinEntry is the public API; pinnedHits/pinnedMisses atomic counters are the new stats. PinnedCount() exposes current size for observability. Step 2 of the storage-trunk pin prototype.

Adds a function to pre-pin commitment branches for a given contract's storage subtree. Walks the trie depth-by-depth from depth 64 (storage subtree root) down to maxDepth: reads each branch via the supplied CommitmentReader, pins it via BranchCache.PinEntry, decodes the child bitmap, and recurses only into children that actually exist (no blind 16-way probing). contractHash is keccak256(address); the trunk lives at the prefix corresponding to the first 64 nibbles of the storage path. For a dense storage subtree (the SSTORE-bloat workload's bloat contract), expected pin count is ~16 + 256 + 4096 ≈ 4.4 K branches at depth 65/66/67, plus the root at depth 64. Sparse subtrees produce fewer. Loading strategy: per-prefix lookups via the reader. Simplest correct implementation. A bulk seg.Getter range-scan over sorted .kv would amortize disk seeks (per the parity-cluster observation in the consolidation memo) but requires building a prefix-range API on top of recsplit; defer until per-prefix lookup is shown to bottleneck. Step 3 of the storage-trunk pin prototype.

Adds a SharedDomains constructor hook that, when PIN_CONTRACT_TRUNKS is set, fires a one-shot background goroutine to preload the storage-subtree-trunk of each listed contract into BranchCache's pinned tier. Format: comma-separated list of 64-hex-char contract hashes (each is keccak256(addr)). Mechanism: - BranchCache.TryClaimPreload (atomic CAS) ensures the goroutine fires exactly once per cache lifetime, even though many SDs may be constructed (per-tx instances etc.). - Goroutine wraps sd.GetLatest as a CommitmentReader and calls commitment.PreloadContractTrunk for each contract hash, depth 64-70. - Logs progress per contract on completion. Closure-over-(sd, tx) is the prototype shape — works for the bench (both live for the whole process). Production deployment needs to revisit the lifetime — sd's tx may not outlive the goroutine. Step 4 of the storage-trunk pin prototype. Bench measurement is the next step (commit 5).

Previous async-goroutine shape (d204c1b) shared the SD's MDBX tx with the calling thread. Concurrent cursor use under the same tx tripped Go's cgo-pointer-pinning runtime check: panic: runtime error: cgo argument has Go pointer to unpinned Go pointer surfacing in an unrelated PruneBlocks goroutine during boot. Make the preload synchronous in the SD constructor for now: same TryClaimPreload guard (fires once per cache lifetime), but no goroutine. Boot pays the per-contract preload time as a one-off. Background-with-own-tx is the proper shape and remains a follow-up; owning the SD's tx exclusively for the preload duration is the safe shape until that lands.

… cap The previous bench (run-pin-trunk-instrumented-cold-cgroup-191347) hung at SD construction with no [trunk-preload] log lines for 5+ minutes. Erigon never reached "engine RPC ready" so all blocks came in as SYNCING. Two changes to localise + bound: 1. **Localisation**: add INFO logs at triggerTrunkPreload entry, per-contract starting/done with took, and a 500-prefix progress log inside PreloadContractTrunk. Whatever it does (or hangs on) is now observable. 2. **Bound**: cap PreloadContractTrunk at 10000 branches (vs ~4.4K expected for a saturated 4-level subtree at maxDepth=67). Drops maxDepth from 70 → 67 in the trigger (depth 64-67 = 16+256+4096 max branches) so we don't recurse into the per-slot tail where pinning has no value. Preload fails-fast on pathological subtrees rather than blocking SD construction indefinitely.

The previous shape (d204c1b) ran triggerTrunkPreload BEFORE sd.SeekCommitment in NewSharedDomains. Bench result: when the preload fired (PIN_CONTRACT_TRUNKS set), every subsequent block came back SYNCING — engine kept attempting backward-download which fails on this peerless setup, no block ever validated, no cache-fp ever fired. Without the preload firing, the same binary works fine (verify-bench PASS at 3.26s). Hypothesis (untested but matches the symptom): preload's sd.GetLatest reads ran before SeekCommitment had resolved the SD's view of the chain head. Pinned values were therefore inconsistent with the committed state, and the trie compute on the first block got wrong root → SYNCING → backward-download → no peers → death spiral with no Flush ever updating the (stale) pinned entries. Fix is mechanical: move the preload call to after SeekCommitment. The TryClaimPreload guard still ensures fire-once-per-cache lifetime. If subsequent bench shows pin_count > 0 + pin_hit > 0 + blocks validating normally, the hypothesis is confirmed; if SYNCING repeats, the bug is something else and we need to revert and debug differently.

…ParaTrieDB Previous prototype iterations both broke block validation: 1. Async sharing the SD's MDBX tx (d204c1b) → cgo "unpinned Go pointer" panic from concurrent cursor use. 2. Synchronous from NewSharedDomains (5a81976 / 4c9ead456d) → blocked the engine HTTP handler for ~3-4s during the preload window, causing the bench's first NewPayload to be dropped. Confirmed: the bench's height=24358001 is ABSENT from the erigon log; the next received block (24358002) then fails backward-download (no peers) → SYNCING forever. Restructure: - Move trigger from NewSharedDomains to EnableParaTrieDB. The latter is called from the staged-sync exec-stage init, NOT from request handlers, AND has access to a kv.TemporalRoDB. - triggerTrunkPreload now takes the DB (not a tx) and spawns a goroutine that opens its OWN tx via db.BeginTemporalRo. No shared cursors with the main pipeline; no blocking the engine. - Reader uses tx.GetLatest directly (not sd.GetLatest) — the SD layering would re-introduce shared-state risk and isn't needed (pinned bytes don't depend on sd.mem state). Same TryClaimPreload guard ensures the preload fires once per BranchCache lifetime regardless of how many SDs construct. If this works the bench should: - Show [trunk-preload] log lines firing once - Pin ~4369 branches - TEST block cache-fp shows pin_hit > 0 and files_comm < 1K - All blocks validate normally (no SYNCING failure)

Make the trunk-pin maxDepth configurable via env (default 67) so we can sweep depths to find the memory/perf sweet spot without rebuilding. Bump the per-contract maxBranches cap from 10K to 200K so deeper saturated subtrees don't get truncated mid-walk.

The previous code disabled the Warmuper for the parallel commitment path out of concern that it would interact with the calculator's SetUpdates call. In practice the Warmuper's reads are independent of the calculator's update buffer — they pre-fetch branch data while EVM execution runs, and the calculator's SetUpdates only affects ComputeCommitment's input set, not the warmup paths. Re-enabling produces a measured 8× throughput improvement on the perf-devnet-3 SSTORE-bloated benchmark (block 24358306, the canonical fixture for #20920), restoring the win first observed in Run H/I of the trie-perf investigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds atomic counters (branch hit/miss/evict + bytes-served, account hit/miss, storage hit/miss) to WarmupCache, plus a Stats() string formatter and ResetStats() for per-Process accumulation reset. Counters are updated on every Get/Evict path (existing Put paths were already counted via cache size). Useful for: - Confirming warmup effectiveness in production logs - Per-block diagnostics when investigating commitment perf - Future per-pool dashboards once a coordinator/observability layer lands (tracked separately) No behavior change beyond the counter updates themselves. Stats() format is one line, suitable for embedding in the existing LogCommitments output. ResetStats() zeros counters without touching cached data — useful for per-Process windowed measurement. Clear() also resets counters along with the data, since data and counters were accumulated together. Test: TestWarmupCache_Stats covers hit/miss/evict accounting across branch/account/storage paths and verifies Stats() format + ResetStats() preserves data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure refactor — behavior preserved. Lifts the unfold-loop from HexPatriciaHashed.followAndUpdate into its own method, parameterized over (hashedKey, plainKey) and intended as the per-key traversal primitive that future orchestrators consume. Today only followAndUpdate calls it, replacing the inline loop with a one-line call. The extracted method preserves the existing metric attribution (StartUnfolding) and trace-print behavior verbatim. Why now: this is the second step in the representation-reduction sequence (see agentspecs/trie-data-pipeline-complexity-tax.md). Future PRs will introduce orchestrators that drive unfold-only walks of touched-key paths to fill cell state without going through the full fold/update cycle: - Cache populator (decoded-payload BranchCache) needs to walk a touched-key path and capture the cells encountered, without triggering fold or modifying the trie's update buffer. - Stage E parallel pre-unfold orchestrator drives unfoldKeyPath across multiple HexPatriciaHashed instances concurrently to pre-warm trie state before commit. Both consume the same primitive. Centralising it now means each future orchestrator is a thin wrapper rather than a duplicate of the unfold-loop logic. Tests: full commitment test suite passes without modification (all 8+ test files in execution/commitment/), confirming the refactor preserves followAndUpdate's behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the carry-as-is correctness invariants from the PR #19954 investigation as scaffolding on the existing WarmupCache: - branchEntry.dirty atomic.Bool — signals "stale until cleared" - PutBranchIfClean(prefix, data) bool — skips write if entry dirty - MarkBranchDirty(prefix) — mark for later refusal of stale puts These are scaffolding additions; no callsites use them yet. Existing PutBranch unconditionally overwrites and clears any prior dirty flag (creates a fresh entry), preserving today's semantic exactly for existing callers. Why now: the prototype investigation (see agentspecs/commitment-cache-prototype-dev-context.md) found that inline-invalidate-on-write is incompatible with deferred encoding — update-in-place breaks correctness because there's a window between fold (computes hash, holds new state) and encoder (writes encoded bytes) where readers see stale cached bytes. The reth-research (agentspecs/reth-1ggas-research.md §4) calls the dirty-flag pattern out as the design that resolves this without forcing synchronous encoding: the encoder marks the entry dirty BEFORE its own write completes, so any racing read knows to bypass the cache for that key. Today's WarmupCache lifecycle (per-Process, warmup completes before fold begins) does NOT exhibit this race — these invariants are infrastructure for the future cross-block persistence work where warmup-style writers can outlive their parent Process. Tests: - TestWarmupCache_DirtyFlag: PutBranchIfClean refuses dirty entry, unconditional PutBranch clears dirty. - TestWarmupCache_DirtyFlag_MarkAbsentKey: marking absent key is no-op (no panic, no entry created). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an additive read method that returns cached branches in already- decoded form, lazy-decoding on first decoded-read per entry and caching the parsed cells for subsequent reads. No existing callsite changes; existing GetBranch / GetAndEvictBranch / PutBranch callers continue to work with encoded bytes unchanged. Why now: this is step 5 of the representation-reduction sequence (see agentspecs/trie-data-pipeline-complexity-tax.md). The trie's read path currently does GetBranch (encoded) + DecodeBranchInto on every cache hit — paying decode CPU on every read. Switching that callsite to GetBranchDecoded (in a separate later commit) eliminates the redundant decode. The ENCODED form remains the source of truth — the encoder needs it for the merge-with-prev step, and it's what gets written to disk via PutBranch. The decoded form is derived lazily and cached alongside the entry. When PutBranch overwrites an entry, the new entry starts fresh and the next decoded read re-derives from the new bytes. API design notes: - Returns (bitmap, *[16]cell, ok). Caller derives touchMap/afterMap from bitmap based on its own deleted-vs-present-after context — the cache stores cells independent of that context so the same entry serves both readers. - The returned *[16]cell aliases entry-owned storage. Read-only consumption is safe across concurrent calls (decode runs at most once per entry via sync.Once); MUST NOT be modified in place. - Decode error → ok=false (don't count as hit OR miss; caller falls through to canonical re-read). Tests: - TestWarmupCache_GetBranchDecoded: round-trip equality with direct DecodeBranchInto, plus same-pointer reuse on repeat reads. - TestWarmupCache_GetBranchDecoded_Miss: behaves like GetBranch on absent keys. - TestWarmupCache_GetBranchDecoded_TruncatedData: graceful failure on corrupt entry (no panic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ng payloads Closes a timing hole that surfaced once we wired the aggregator-scope BranchCache: between an FCU completing and the next newPayload, MDBX hasn't been committed yet (RunLoop's CommitCycle only fires under memory pressure), but currentContext.mem holds the latest writes from MergeExtendingFork. The fresh doms created in ValidateChain has no parent and a fresh roTx, so its ctx.Branch reads stale-MDBX while the aggregator-scope BranchCache (populated by the prior FV's CollectUpdate writes) holds the fresh state. That's the cross-newPayload divergence pattern observed in the bench (8-61 divergences and wrong-trie-root errors at block 3-14 across runs). Set doms.SetParent(currentContext) when the new payload extends the current canonical head (header.ParentHash == ReadHeadBlockHash). For fork payloads that don't extend head, leave parent unset: unwindToCommonCanonical below reverts doms's view to the common ancestor, and exposing currentContext.mem (post-divergence canonical writes) via the parent chain would shadow the unwound base and break fork validation. Verified by TestReorgsWithInsertChain — the "head-only" predicate is what the current single-canonical-chain SD topology supports. A proper per-branch SD lineage (each fork's validation chains to the last validated SD on its own branch, not always currentContext) is the follow-up needed for concurrent multi-fork validation. The current design supports a single canonical chain only; that's enough to close the divergence we have today, with the lineage extension tracked separately.

Foundation for the "Snapshot vs MDBX read-cost equivalence" investigation (memory: snapshot-vs-mdbx-performance-equivalence.md). This file produces the headline ratio that quantifies the gap the investigation aims to close: warm-cache reads from snapshot .kv files should cost the same as warm-cache reads from MDBX (same disk, same page cache). H0 measures how far apart they are today. Five sub-benches: - MDBX_path full chain, key in MDBX - File_path full chain, key in file - Forced_file_path file-only debug path, file-resident keys - Forced_db_path DB-only debug path, MDBX-resident keys - Bloom_miss_path file-only debug path, MDBX-resident keys (file misses in xorfilter for every probe) Two operating modes; only synthetic is wired in this commit: - Synthetic (testDbAndAggregatorBench fixture): writes 64 full 16-tx steps, BuildFiles + repeated PruneSmallBatches drains all but the tip step into files. Phase 2 keys at txNums past the built-step boundary stay in MDBX. Partition by *actual* residency after setup so bench inputs match where keys really live. - Real-datadir (--snapdatadir flag): TODO. Opens an existing chaindata+snapshots datadir read-only and picks keys via cursor iteration / .kv decompressor walk. Required for production- relevant numbers since synthetic has tiny files and small values. Initial synthetic results on AMD EPYC 4244P (Accounts domain): MDBX_path 173 ns/op File_path 226 ns/op (1.31x MDBX) Forced_file_path 30 ns/op Forced_db_path 158 ns/op Bloom_miss_path 30 ns/op Synthetic dataset is too small to surface the production gap that pprof shows (xorfilter at 35% CPU on real bloat workload). H1-H4 benches and the real-datadir mode are the next steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the --snapdatadir flag path to the H0 bench. Opens an existing chaindata + snapshots datadir read-only via the same recipe as cmd/integration (mdbx Accede + state.New + temporal.New) and picks keys by cursor-walking the per-domain values table. Pragmatic adjustments: - On heavily-pruned production datadirs (perf-devnet-3-run was 100% pruned), every MDBX values-table row is step-shadowed by a file, so getLatestFromDb returns ok=false. The MDBX-side sub-benches skip in this case; File_path numbers stand on their own and the synthetic MDBX_path baseline serves as the cross- mode comparator. - skipIfEmpty short-circuits per sub-bench rather than failing the whole run, so we can still get the file-path numbers. First production numbers (AMD EPYC 4244P, AccountsDomain, 2012 file-resident keys from perf-devnet-3-run, fully pruned): File_path 211 ns/op (synthetic was 226; essentially same) Forced_file_path 30 ns/op (synthetic was 30; identical) Surprising finding: real .kv file reads cost the same as synthetic. This means production bloat-workload bottleneck is NOT in getLatestFromFiles — it must be in HistorySeek (.ef history files walked by HistoryStateReader.GetAsOf). The GetAsOf shortcut work flagged in getasof-regression-suspect.md is the right lead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related additions to the snapshot-vs-MDBX perf-equivalence investigation (memory: snapshot-vs-mdbx-performance-equivalence.md): 1. H_GetAsOf bench (db/state/snapshot_vs_mdbx_bench_test.go). New runHGetAsOf with four sub-benches for HistorySeek-via-GetAsOf on file-resident keys: GetLatest_baseline, GetAsOf_recent (asOf near endTxNum), GetAsOf_mid (asOf at endTxNum/2), GetAsOf_floor (asOf=1). Tests the path the calculator's HistoryStateReader.Read uses, which is distinct from getLatestFromFiles measured in H0. Real-datadir results on perf-devnet-3 (AccountsDomain, endTxNum=2.9B): GetLatest_baseline 202 ns/op 0 allocs GetAsOf_recent 570 ns/op 0 allocs <- 2.8x baseline, no result GetAsOf_mid 235 ns/op 5 allocs GetAsOf_floor 196 ns/op 4 allocs GetAsOf_recent (the calculator's pattern after PR #21010) scans the .ef looking for a record at-or-after endTxNum-1, finds none (most keys haven't changed in the last txNum), falls through to GetLatest. The 370ns/op overhead vs GetLatest is wasted scan. Confirms the GetAsOf shortcut described in getasof-regression-suspect.md as a real lever, though small in absolute terms (~2ms/block on the bloat workload). 2. Surface "took" + "keys" on the existing [commitment][cache-fp] Info log line (commitmentdb). Was already computed in the debug-level "[commitment] processed" log, but the bench runs with --log.dir.disable so debug logs aren't captured. This made it possible to attribute the 4.3s gap inside newPayload(TEST block) directly: the calculator's ComputeCommitment takes 4220ms for the 5910-key bloat block — 91% of the entire block wall time. Per-key cost is ~700us, consistent across blocks of all sizes. The actual perf lever for the bloat workload is making per-branch ComputeCommitment cheaper, not file/state reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds three groups of fields to the existing [commitment][cache-fp] log line so the calculator's per-block behaviour is observable at Info level (the bench runs with --log.dir.disable, so debug logs aren't captured): - took, keys: ComputeCommitment wall + key-count for the block. Pre-existing internally, now surfaced. - load, skipped, reset: process-cumulative counts of computeCellHash decisions: * load = had no memoized stateHash, fetched value from DB * skipped = had memoized stateHash, reused without fetch * reset = had stateHash but had to invalidate it Surfaced via new commitment.SkipLoadResetCounters(). - files_acc / files_sto / files_code / files_comm: per-domain file-read counts pulled from sd.Metrics().Domains[domain]. Decomposes the aggregate `files=N` from the [domain reads] log line into its actual sources (e.g. on the SSTORE-bloated block the 32k file reads break down as 5.9k Storage value loads + 26.6k Commitment branch reads + a handful of others). All counters are cumulative; per-block deltas are obtained by subtracting consecutive cache-fp lines. Pure observability — no behaviour change. Used as the measurement framework for the snapshot-vs-MDBX perf-equivalence investigation and the follow-on commits that target specific levers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ResetExec wipes the commitment table but the aggregator's in-memory branchCache was holding entries that referenced the just-deleted trie nodes. A from-0 re-exec then served those stale entries when computing block 0's commitment, producing a wrong trie root and dropping every genesis-allocated balance that no subsequent block touched. Serial exec happened to skirt the issue; parallel exec hit it directly (test_account_access/EXTCODESIZE-contract behaviour on mainnet block 46147 against 0xA1E4380A3B1f749673E270229993eE55F35663b4, and the TestFromZero_GenesisAllocPreservedAfterResetReExec parallel subtest). Drop the cache as part of ResetExec so it repopulates from the freshly- wiped commitment table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

mh0lt · 2026-05-23T19:49:40Z

Handoff: investigation pointers for the 22 CI failures

Posting these here so anyone picking up the investigation has a self-contained starting point.

What's failing

22 of 68 checks on PR HEAD f151829dca. All 22 are eest-spec-tests / eest-spec-enginextests-* variants (benchmark 1m/5m/10m/30m/60m/100m/150m × parallel/sequential, plus stable parallel + sequential, statetests stable + devnet, plus ci-gate aggregator). Each produces the same error, repeated in a retry loop:

[EROR] [5/5 Execution] Wrong trie root of block 1
        got      1fba572808f179cdeba16a954d35c390af594b4ccf0fbdc537713468218aea31
        expected ef5a9a5b0a0c67e8c3b2e82e89a8cb91e41770955a4f47792c322d2b1e7f141d

Pull a fresh job log:

gh api repos/erigontech/erigon/actions/jobs/77542070364/logs \
  | grep -E "Wrong trie root|invalid block|panic|FAIL" | head -20

Failing CI run: https://github.com/erigontech/erigon/actions/runs/26340677958

The 3 hive failures on the same run are a different shape — go-ethereum build/dep noise, not the trie-root regression. Ignore them for this investigation.

Diagnosis done already

Main is green for the same tests:

gh api repos/erigontech/erigon/commits/ff47bb5bce/check-runs --paginate \
  --jq '[.check_runs[] | select(.name | startswith("eest-spec-tests"))] |
        {failure: map(select(.conclusion=="failure")) | length,
         success: map(select(.conclusion=="success")) | length,
         total: length}'
# → {"failure":0,"success":31,"total":31}

So the cause is in the 74 commits the PR is ahead of main, not on main itself.

The newest commit on the PR (f151829dca, the BUG #21138 fix in ResetExec) is exonerated. That fix only changes execution/stagedsync/rawdbreset/reset_stages.go::ResetExec. Callers of that function:

grep -rn "rawdbreset\.ResetExec" --include="*.go"
# cmd/integration/commands/stages.go:643          (integration tool)
# db/test/domains_restart_test.go:220, 383        (its unit test)
# execution/execmodule/execmoduletester/from0_genesis_internal_test.go:132  (verifies #21138)

The EEST runner is cmd/evm/enginexrunner.go → engineapitester.EngineXTestRunner (engine-API path). It does not call ResetExec. Don't revert f151829dca — it fixes a separate, verified bug (#21138).

The suspect family — BranchCache lifecycle

The error pattern ("wrong trie root of block 1 in engine-API runner") points at state-installation-into-SharedDomains at the genesis→block-1 boundary. Top suspects from the 73 pre-fix commits, most recent first:

sha	subject
`b2a9d649aa`	db/state/execctx, execution/execmodule: clear BranchCache on SetHead unwind
`ae9b3f3ebb`	execution/commitment: keep the trie checkpoint key out of BranchCache
`3065d89b9a`	execution: parent fork-validation SD into the canonical generation
`9ef3b9c6a0`	state/execctx: invalidate stale BranchCache entries on SD.Unwind
`733dad5385`	execmodule: chain ValidateChain SD to currentContext for head-extending payloads

These commits originated on mh/all-stack; they were never CI'd against post-#21027 main before this push. Main's commit 12113dd93b cmd/evm: add enginexrunner alongside staterunner and blockrunner (#21027) introduced the EEST runner that this PR's CI now exercises. So this is a new combination: cache stack × enginexrunner.

Bisection plan

Step 1 — repro locally (optional but faster than CI for narrow)

git fetch origin mh/perf-caches-pr
git worktree add /tmp/wt-pr21380 mh/perf-caches-pr
cd /tmp/wt-pr21380
make erigon  # we've verified this builds
# Drive a small enginextest fixture through ./build/bin/erigon enginextest <fixture>
# Expected output: 'wrong trie root of block 1' error.
# If local repro is slow, skip to Step 2 (CI dispatch via probe branches).

Step 2 — revert the BranchCache lifecycle family as a unit

cd /tmp/wt-pr21380
git checkout -b mh/issue21380-probe-revert-branchcache-family
# Newest first (in case of dependencies between them):
git revert --no-edit b2a9d649aa
git revert --no-edit ae9b3f3ebb
git revert --no-edit 3065d89b9a
git revert --no-edit 9ef3b9c6a0
git revert --no-edit 733dad5385
# If conflicts get hairy, use `git rebase -i` with `drop` lines instead.
make erigon && make lint
git push -u origin mh/issue21380-probe-revert-branchcache-family
gh pr create --base main --head mh/issue21380-probe-revert-branchcache-family \
  --title "[PROBE-ONLY] revert BranchCache lifecycle family — investigates #21380 EEST regression" \
  --body "Probe for the suspect family in #21380. Do not merge."

If the EEST jobs pass on the probe → cause is in those 5 commits; go to Step 3.
If they still fail → widen the revert. Next-most-suspect: the trunk-preload pipeline (9b7c02aece, 4410704e7a, be2616e6c2, b082c93a4d, 3e0154f245, 4c9968a437, 9db0d5955b, eb0c0c5803).

Step 3 — narrow within the suspect family

Re-add each suspect commit one at a time on top of the all-reverted branch. The commit whose re-addition flips CI back to red is the culprit (or part of a pair — re-add pairs if single re-adds all pass).

git checkout mh/issue21380-probe-revert-branchcache-family
git checkout -b mh/issue21380-probe-restore-733dad5385
git cherry-pick 733dad5385  # oldest first — least likely to depend on others
git push -u origin mh/issue21380-probe-restore-733dad5385
gh pr create ...

Step 4 — fix properly

The bug is almost certainly cache state not matching DB state at the genesis→block-1 boundary on the engine-API path. The fix template — verified working for the analogous #21138 (which sat at the same boundary on the integration-tool path) — was: invalidate the relevant cache at the lifecycle event that produces the mismatch. Do not disable the cache — per the framing in this PR's body ("on and completely trustworthy, or off to measure performance differences, not off because it unexpectedly breaks things"), turning caching off is not an acceptable fix.

The EEST runner constructs short-lived nodes through engineapitester.EngineXTestRunner; check whether the failure is on the second-and-subsequent test in a multi-test fixture (state from test N leaking into test N+1's genesis) — that would point at a missing cache-clear between tests, exactly parallel to ResetExec.

Relevant files:

cmd/evm/enginexrunner.go — test driver entry
execution/engineapi/engineapitester/engine_api_tester.go — test harness
db/state/aggregator.go — search for branchCache; cache lives at a.d[CommitmentDomain].branchCache
execution/commitment/branch_cache.go — Clear(), Invalidate(), GetWithOrigin() methods documented at the top
execution/stagedsync/rawdbreset/reset_stages.go — see f151829dca for the analogous Parallel-exec: route the per-tx writeset through one faithful path; remove normalizeWriteSet/calcState; carry serial-finalize signals on ExecutionResult #21138 fix shape

Step 5 — push the fix

git checkout mh/perf-caches-pr
git merge <your-fix-branch>   # or cherry-pick the single fix commit
make lint && make erigon integration
git push origin mh/perf-caches-pr

CI re-runs on this PR automatically. Confirm EEST is green before declaring done.

Reference SHAs

PR HEAD: f151829dca
Pre-merge PR HEAD: b2a9d649aa
Main HEAD this was rebased onto: ff47bb5bce
Failing CI run: https://github.com/erigontech/erigon/actions/runs/26340677958
Sample failing job (benchmark-1m-parallel): https://github.com/erigontech/erigon/actions/runs/26340677958/job/77542070364

When done

Post the culprit commit + fix as a comment back on this PR. The merge-gating [!IMPORTANT] note in the PR body still applies: both parallel and serial CI must be green before merge.

mh0lt · 2026-05-23T19:56:50Z

Picking this up. Step 2 of the bisection plan done — probe-revert PR at #21381 reverts the 5 BranchCache lifecycle suspects as a unit, on top of current #21380 HEAD (`f151829dca`).

All 5 reverts applied cleanly, `make erigon` / `make integration` / `make lint` clean.

Waiting on its CI to land. Will report back here with:

If EEST is green → narrow within the 5 (cherry-pick back one-at-a-time on `mh/issue21380-probe-restore-` branches per Step 3).
If EEST is still red → widen to the trunk-preload pipeline family per the handoff.

I'll edit the PR with the culprit + fix once narrowed.

The default came from `runtime.NumCPU()` which puts a worker on every visible core. The apply loop, FCU, GC, and other background goroutines end up fighting workers for runqueue slots, and the contention shows up as wall-time variance on every busy core. Drop the default by one so those non-worker goroutines have somewhere to land without preempting a worker. Floored at 1 for single-core environments. Override via EXEC3_WORKERS=N. Also removes the stale 'only half of CPU' comment on ethconfig.Defaults.Sync.ExecWorkerCount, which has been wrong since the default became NumCPU(). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This reverts commit 9898f3c.

mh0lt · 2026-05-23T23:17:08Z

Probe-1 result (#21381): the 5 BranchCache-lifecycle reverts did not clear EEST. Same deterministic `1fba572808...` vs `ef5a9a5b...` failure. So those 5 are exonerated, and the missing invalidation is not on the unwind side.

Widening to the trunk-preload 8 hits stacked-series conflicts (later commits delete files like `preload.go` that earlier ones modify; sequential reverts can't be applied cleanly).

CI log analysis on the failing job confirms the handoff's tester-reuse hypothesis directly:

`Collected 1076 tests across 3 (fork, preAllocHash) groups; running with 1 workers` → 1041/1076 failures.

→ One worker, three (fork, preAllocHash) tuples, ~360 tests per tuple all sharing a single cached tester. The first test of each tuple sets up cache state; every subsequent test inherits it. The "state from test N leaks into test N+1's genesis" pattern from the handoff.

Probe-2 PR #21383 validates this directly: insert `extr.Evict(test.Fork, test.PreAllocHash)` at the top of `EngineXTestRunner.Run` so every test runs against a fresh node. Not a fix (it defeats the cache), but a 1-line diagnostic.

If EEST goes green on [PROBE-ONLY] EngineXTestRunner: force-evict tester per Run — investigates #21380 EEST regression (hypothesis: cache reuse across tests) #21383 → cache-isolation confirmed. Real fix on State Cache Consolidation (PR #1 of the perf stack) #21380: in-place cache-reset between tests so caching survives within a single test's payload sequence but doesn't poison the next.
If EEST still fails on [PROBE-ONLY] EngineXTestRunner: force-evict tester per Run — investigates #21380 EEST regression (hypothesis: cache reuse across tests) #21383 → shared state beyond the cache is involved; widen.

Will report when #21383 CI lands.

…te tests State tests are now production-aligned: each subtest creates a SharedDomains scope, runs pre-state loading + tx execution + commitment against it, and discards the SD (Close without Flush) when the subtest finishes. Closing without Flush prevents per-subtest writes from entering the long-lived aggregator branch cache, fixing the cross-subtest pollution where test N's pre-state was leaking into test N+1 via the cache and producing wrong-trie-root failures even when each subtest had a fresh DB tx (the tx rollback rolls back disk only — it does not invalidate the cache). * MakePreState now splits into MakePreState (kept for tracetest callers that need flushed pre-state) and MakePreStateInto (no Flush; caller owns the SD). * StateTest.Run / RunNoVerify take a caller-supplied SD. * runStateTests (execution/tests) and runStateTest (cmd/evm) create one SD per subtest with a deferred Close so the cache stays clean.

…sd.mem The SD-discard refactor moved pre-state load into sd.mem and dropped the final Flush, but the tx execution reader was still constructed from the bare tx (NewLatestStateReader(tx)) — which only sees MDBX. Result: pre-state balance writes lived in sd.mem and the tx read MDBX-empty, surfacing as "insufficient funds for gas * price + value" on every state test that funded an account. NewLatestStateReader(sd.AsGetter(tx)) routes reads through the SD's sd.mem -> branchCache -> aggTx layering, so pre-state writes are visible. Same fix applied to MakePreStateInto for consistency with its writer.

…ween cached-tester runs The EngineXTestRunner caches one tester per (fork, preAllocHash) tuple and reuses it across every EEST case that shares those keys — so the slow genesis + node + datadir bootstrap is paid once per group. The aggregator's in-memory branchCache is scoped to the tester's lifetime, which means test N's pre-state populates cache entries that test N+1 still sees when it asks for its own genesis commitment — surfacing as a deterministic 'wrong trie root of block 1' on the second-and-later tests in each group. ResetBranchCache() on EngineApiTester drops the cache via the same HasAgg / aggTx.BranchCache().Clear() path the rawdbreset.ResetExec fix uses (#21138). EngineXTestRunner.Run calls it before each execute so the group's cached tester starts the next case with an empty cache, matching the from-scratch state every test would see if it built its own tester. Cheaper than evicting the tester (which would rebuild the genesis + node on every test) and keeps the cache enabled for the multi-payload sequence within a single test — where the cache provides the perf win.

taratorio · 2026-05-25T07:36:39Z

+	// aggregator-scope branchCache so commitment-trie entries from the prior
+	// test don't leak into this test's genesis→block-1 boundary. Cheaper than
+	// evicting the whole tester (which would rebuild the genesis + node).
+	tester.ResetBranchCache()


this looks like AI slop masking re-org related bugs

the EngineApiTester/EngineXTestRunner should not need to reset any caches because they operate entirely on the NewPayload/FCU engineapi surface.

when an engineapitester is re-used for a different set of tests with the same forkid+genesis the node does a reorg back to genesis when it receives NewPayload+FCUs from new tests and so the existing code path should handle these correctly without this

This PR ships the parallel-exec correctness fixes from `mh/parallel-exec-fixes` onto the perf stack, packaged as a focused PR on top of [#21386 (StateCache LRU)](#21386) which itself stacks on [#21380 (State Cache Consolidation)](#21380). > [!IMPORTANT] > **Stacks on #21386 → #21380.** Base is `mh/perf-statecache-lru-pr`, NOT `main`. Merge order: #21380 → #21386 → this PR. > [!IMPORTANT] > **Do not merge until CI is green on both parallel and serial.** Same gating rule as #21380 / #21386. ## Scope — 13 commits from `mh/parallel-exec-fixes` Brought in via a merge commit so the bisection trail is preserved. | sha | what it fixes | |---|---| | `25053e38e9` | parallel SD-of-pre-existing-contract — the 197-line foundational fix | | `2e2bf3ccc0` | clean exit when single-block batch already covered maxBlockNum | | `6e451f5ed2` | don't emit StoragePath=0 writes from IBS.Selfdestruct | | `616a4fa0a8` | clear calc Deleted on a non-SD account write even when zero | | `d99f2f704d` | gate known parallel-exec failures behind EXEC3_PARALLEL (#21136) | | `34e83e82b7` | install per-block changeset accumulator before any of the block's writes | | `b340d7e592` | drop stale sd.mem 'Trim old version entries' comment | | `629cc23566` | O(1) CollectorWrites fee-balance update, drop dead VersionedWrites.SetBalance | | `a0ecfc7e12` | first-match-wins in CollectorWrites BalancePath index | | `445f97e446` | emit EIP-7708 Burn log under parallel-exec when coinbase self-destructs | | `5e1f5fa901` | mirror ReadAccountData SD-revival check into versionedRead | | `a5dc83f509` | drop two stale EXEC3_PARALLEL t.Skips | | `8af901104f` | drop TestReceiptHashFromRPC unit-suite RPC integration test | ## Merge conflicts resolved 3 files, 8 regions — all resolved by keeping HEAD's typed-readset / per-path revival shape and confirming HEAD already absorbs each fix's intent. See the merge commit message (`cfc4ec1418`) for the per-region rationale. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Mark Holt <erigon@dev-bm-e3-ethmainnet-n4.erigon.io>

# Conflicts: # execution/execmodule/exec_module.go # execution/stagedsync/exec3.go

…al and ValidatePayload (fix 6 hive re-org failures) ValidateChain's fork-payload path runs unwindToCommonCanonical → ValidatePayload. SD.Unwind (called transitively by unwindToCommonCanonical) does selective per-key BranchCache invalidation only for entries listed in changeset[kv.CommitmentDomain]. Commitment-trie internal branches derived during the canonical flush sit in the aggregator-scope BranchCache without appearing in any changeset entry — so the selective sweep misses them and they survive the unwind as stale canonical-lineage entries that poison the fork-payload trie reads → wrong trie root → EngineNewPayload returns INVALID on fork payloads that should be VALID. The PR author's own doctrine on HexPatriciaHashed.Reset (execution/commitment/hex_patricia_hashed.go:2970-2975) explicitly states: "The BranchCache is intentionally NOT cleared here ... Callers that need to invalidate the cache (unwind, fork validation) MUST call ClearBranchCache explicitly." SetHead unwind already follows this doctrine (set_head.go:159 calls sd.ClearBranchCache()). The ValidateChain fork-payload path didn't — this commit closes that gap. Symptom this fixes: all 6 ethereum/engine hive failures on the merge-from-main run (beeyx741q) returned INVALID where VALID expected on re-org scenarios: - Invalid Missing Ancestor ReOrg, StateRoot, EmptyTxs={T,F}, Invalid P{9,10} - Re-Org Back into Canonical Chain, Depth=10, Execute Side Payload on Re-Org - Withdrawals Fork on Block N - M Block Re-Org NewPayload - Withdrawals Fork on Canonical Block N / Side Block M - 10 Block Re-Org All six are EngineNewPayload* on re-org / fork-payload scenarios — exactly the cache-pollution shape this fix addresses. Symmetric across serial and parallel jobs, confirming it's not a parallel-exec regression. Reviewer context: this is the production-side fix taratorio asked about in PR comments — the test-runner ResetBranchCache call is not "AI slop" but defense for the same gap; this commit closes the production-side gap so the test-runner reset becomes belt-and-braces rather than load-bearing.

mh0lt · 2026-05-26T12:43:44Z

Push: production-side BranchCache invalidation fix for the 6 hive re-org failures on the prior CI run (beeyx741q).

Adds an explicit doms.ClearBranchCache() between unwindToCommonCanonical and forkValidator.ValidatePayload in ValidateChain's fork-payload path. The PR author's own doctrine on HexPatriciaHashed.Reset (execution/commitment/hex_patricia_hashed.go:2970-2975) explicitly requires "Callers that need to invalidate the cache (unwind, fork validation) MUST call ClearBranchCache explicitly." SetHead already follows this doctrine; the ValidateChain fork-payload path didn't until now.

This is the production-side counterpart to the engineapitester's ResetBranchCache that @taratorio flagged — that test-runner call is defense for the same gap; with this commit the production path closes the gap so the test-runner reset becomes belt-and-braces rather than load-bearing.

Commit: 6b582fd

Mark Holt and others added 30 commits May 21, 2026 21:35

Mark Holt and others added 2 commits May 23, 2026 18:19

Merge remote-tracking branch 'origin/main' into mh/perf-caches-pr

9f5086b

mh0lt requested review from AskAlexSharov, awskii, sudeepdino008, taratorio and yperbasis as code owners May 23, 2026 18:47

mh0lt mentioned this pull request May 23, 2026

Performance regression on the chain tip #21008

Open

mh0lt mentioned this pull request May 23, 2026

[PROBE-ONLY] revert BranchCache lifecycle family — investigates #21380 EEST regression #21381

Closed

Mark Holt and others added 2 commits May 23, 2026 20:01

Revert "common/dbg, node/ethconfig: default ExecWorkerCount = NumCPU-1"

6454b7b

This reverts commit 9898f3c.

mh0lt mentioned this pull request May 23, 2026

[PROBE-ONLY] EngineXTestRunner: force-evict tester per Run — investigates #21380 EEST regression (hypothesis: cache reuse across tests) #21383

Closed

This was referenced May 24, 2026

StateCache LRU + Mode rework (PR #2 of the perf stack) #21386

Open

Parallel-exec correctness fixes (PR #3 of the perf stack) #21387

Merged

mh0lt added 3 commits May 25, 2026 07:26

taratorio reviewed May 25, 2026

View reviewed changes

taratorio requested changes May 25, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into mh/perf-caches-pr

3b11e0e

# Conflicts: # execution/execmodule/exec_module.go # execution/stagedsync/exec3.go

This was referenced May 26, 2026

FCU semaphore decouple + foreground-priority bg-commit worker (PR #4 of the perf stack) #21414

Open

BAL-driven parallel commitment (PR #5 of the perf stack) #21416

Open

mh0lt changed the title ~~State Cache Consolidation~~ State Cache Consolidation (PR #1 of the perf stack) May 26, 2026

mh0lt added this to the 3.5.0 milestone May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State Cache Consolidation (PR #1 of the perf stack)#21380

State Cache Consolidation (PR #1 of the perf stack)#21380
mh0lt wants to merge 82 commits into
mainfrom
mh/perf-caches-pr

mh0lt commented May 23, 2026 •

edited

Loading

Uh oh!

mh0lt commented May 23, 2026

Uh oh!

mh0lt commented May 23, 2026

Uh oh!

mh0lt commented May 23, 2026

Uh oh!

taratorio May 25, 2026 •

edited

Loading

Uh oh!

mh0lt commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mh0lt commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. BranchCache — single aggregator-scope commitment cache

2. Trunk preload pipeline

3. Cache lifecycle fixes — the "trustworthy or off" guarantee

BUG #21138 — parallel-exec from-0 wrong trie root

Root cause

Fix

Verification

State

Uh oh!

mh0lt commented May 23, 2026

Handoff: investigation pointers for the 22 CI failures

What's failing

Diagnosis done already

The suspect family — BranchCache lifecycle

Bisection plan

Step 1 — repro locally (optional but faster than CI for narrow)

Step 2 — revert the BranchCache lifecycle family as a unit

Step 3 — narrow within the suspect family

Step 4 — fix properly

Step 5 — push the fix

Reference SHAs

When done

Uh oh!

mh0lt commented May 23, 2026

Uh oh!

mh0lt commented May 23, 2026

Uh oh!

taratorio May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mh0lt commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mh0lt commented May 23, 2026 •

edited

Loading

1. `BranchCache` — single aggregator-scope commitment cache

taratorio May 25, 2026 •

edited

Loading