StateCache LRU + Mode rework (PR #2 of the perf stack) by mh0lt · Pull Request #21386 · erigontech/erigon

mh0lt · 2026-05-24T11:20:46Z

This PR ships the execution/cache LRU/Mode rework + the StateCache population commits as a follow-on to PR #21380 (State Cache Consolidation). The LRU/Mode rework was always meant to ship separately so the policy change can be reviewed independently of #21380's BranchCache work.

Important

Stacks on #21380. Base is mh/perf-caches-pr, NOT main. Merge order: #21380 → this PR.

Important

Do not merge until CI is green on both parallel and serial.

Scope — 11 commits cherry-picked from `mh/all-stack`

sha (rebased)	source	subject
`cb4443bf51`	`fba4ce8999`	`execution/cache, db/state/execctx`: SD-transparent ethHash bypass for CodeDomain
`d75ec41fcd`	`7d0998d0db`	`execution/cache, db/state, execution/state`: codeSizeCache for EXTCODESIZE / EXTCODEHASH
`77cf879d9a`	`cbe9044e52`	`execution/exec, execution/execmodule`: BlockReadAheader populates cache.StateCache
`67297a5dfe`	`f2d4c3df74`	`execution/state, execution/cache`: stateObject.code populate + addrToHash LRU
`cca736e34d`	`7c3e054063`	`execution/cache, db/state/execctx`: addr → codeHash LRU above SD
`2a21a81608`	`c8f10544c0`	`execution/exec`: cachePopulatingGetter caches negative results
`2eea7d2c61`	`d01a345062`	`execution/cache`: surface fill-and-freeze cliff via inserts/dropped counters
`576c5ade3e`	`8052c84831`	`execution/cache`: replace GenericCache map with sharded LRU + Mode
`8e239f3518`	`6b785d4360`	`execution/cache`: STATE_CACHE_MODE env override at NewStateCache time
`ad9f74c897`	`c55128565a`	`execution/cache`: correct the LFU rationale in Mode docstring
`266e2979bd`	`f80655f6d2`	`execution/cache`: reduce default cache caps to 100 MB each (bench knob)

One commit deferred

The 12th commit on the original handoff list — 66bcc44702 (BAL-driven BlockStateCache prewarm) — has been dropped from this PR because it depends on the execution/balcache package, which is introduced by PR-A (eth/71 BAL wire protocol) off main. It will be reintroduced as a small follow-up PR once both this PR and PR-A have merged.

🤖 Generated with Claude Code

…CodeDomain Adds a third map (`ethHashToCode`) to CodeCache, keyed by the 32-byte Ethereum codeHash (keccak256). New methods `GetByEthHash` and `PutWithEthHash` expose direct L2b access without going through the addr→maphash→code two-level path. The byte storage duplicates L2 in the worst case (2x code-bytes memory at the cap); accepted for the per-key fast path on many-addrs-one-code workloads. `SharedDomains.GetLatest(CodeDomain, ...)` consults L2b transparently: when the addr-keyed cache misses, resolve the codeHash from the AccountsDomain (typically warm because the EVM just loaded the account), probe `stateCache.GetCodeByHash` before falling through to the file accessor stack. On miss, fill both L1 and L2b via PutCodeWithHash. The fast path is unchanged. Workload shape this targets: many addresses sharing one codeHash (proxies, factory-deployed clones, ERC-20 holders, OpenZeppelin templates). Today's addr-keyed cache misses on every fresh address even when the bytecode is already cached. With this change a single L2b entry serves N addresses after the first population. Microbench results: - BenchmarkCodeCache_GetByEthHash_Hit: 17.01 ns/op - BenchmarkCodeCache_GetByEthHash_Miss: 15.45 ns/op - BenchmarkCodeCache_Get_AddrLevel_Hit: 32.44 ns/op (existing) - BenchmarkCodeCache_GetByEthHash_ManyAddrs: 17.02 ns/op L2b hit is ~2x faster than the existing two-level addr path (one map probe vs two), and enables hits on workloads where L1 would miss. Cross-client research at agentspecs/cross-client-state-access-2026-05-14.md notes geth's separate codeSizeCache as the further (geth-proven) win for EXTCODESIZE/EXTCODEHASH and addrToHash LRU as a one-line behaviour fix; both queued as follow-up surgical commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…SIZE / EXTCODEHASH Adds a third caching layer to CodeCache (alongside L1 addr→maphash and L2b ethHash→bytes): codeSizeByEthHash maps the 32-byte Ethereum codeHash to its byte length. Tiny per-entry footprint (32B key + 8B value vs 5-10 KB for full bytes) so the same memory budget gives ~1000x the hit surface. Capped at 1M entries (geth core/state/database_code.go uses the same size). EXTCODESIZE / EXTCODEHASH callers — historically the slowest opcodes on the lab dashboard's bench — answer from a single map probe without paying the file accessor stack cost of the full bytes. Geth-proven; cross-client writeup at agentspecs/cross-client-state-access-2026-05-14.md notes this as the largest single available win for the synthetic bench. Wiring: - CodeCache.GetCodeSizeByEthHash / PutCodeSizeByEthHash — direct accessors. - PutWithEthHash now populates the size layer alongside L2b, so every bytes-load creates a future fast-path entry "for free". - StateCache wrappers GetCodeSizeByHash / PutCodeSizeByHash. - SharedDomains.GetCodeSize(tx, addr) — the SD-transparent fast path: resolve codeHash via the AccountsDomain cache chain, probe the size cache, then L2b, then file-read+populate. Returns (0, false, nil) for EOAs and no-code accounts without paying any file read. - temporalGetter.GetCodeSize so callers reach it via the existing getter. - ReaderV3.ReadAccountCodeSize type-asserts on a codeSizeGetter interface and routes through the fast path when the underlying getter supports it; falls back to GetLatest+len otherwise. No kv.TemporalGetter interface change. Limitation: capacity is no-op-when-full, not LRU. A separate surgical commit will swap to real LRU eviction; mirrors the addrToHash fix queued from the same cross-client writeup. Tests: 3 new (PopulatedAlongsideBytes, DirectPutAndGet, EmptyHashOrNegativeIsNoOp). All existing CodeCache tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e.StateCache The BlockReadAheader has always prefetched BAL-listed (and access-list) addresses' account/code/storage via a fresh ReaderV3 on a separate RoTx. Its prefetches warmed OS page cache + RoTx cursors — disconnected from the process-global cache.StateCache that SharedDomains.GetLatest probes on the EVM hot path. The two layers were two separate caches; nothing the prefetcher loaded ever reached the EVM's lookup path. Reth's structural advantage on EXTCODESIZE-loop benches is that its prewarm writes to the same hashmap the EVM reads from (crates/engine/execution-cache/src/cached_state.rs:663). When EVM enters, every BAL-listed addr's first read is a 20 ns cache probe — no file accessor stack, no decompression CPU. PR #21128 swapped this from mini-moka to a lock-free fixed-cache for a measured +10.8 % mgas/s. This commit closes the equivalent gap on Erigon: a thin cache-populating TemporalGetter wrapper writes successful reads through to cache.StateCache as a side effect. ReaderV3 is unchanged; the wrapper sits in front. When the prefetcher already has the codeHash from a preceding account read, the next CodeDomain read routes through StateCache.PutCodeWithHash so the L2b (ethHash → bytes) + size-cache layers fill alongside the bare addr-keyed L1. Wiring: - BlockReadAheader.SetStateCache(*cache.StateCache) setter. - ExecModule construction calls readAheader.SetStateCache(domainCache), so the same StateCache the FCU/canonical path wires onto SD is the one the prefetcher warms. - cachePopulatingGetter wraps the worker's ttx; both BAL-warming and tx-warming paths gain the same treatment. Fgprof on the EXTCODESIZE-EXISTING_CONTRACT-30M bench had shown 95 % of EVM wall-clock in seg.Getter.nextPos (Huffman decompression of code values). With this commit, every BAL-listed addr's lookup should hit the cache and skip the file accessor stack entirely — eliminating the dominant cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ash LRU Two surgical commits bundled (both touch the code-read hot path): 1. IntraBlockState.GetCodeSize now loads the full bytes via stateReader.ReadAccountCode on first touch and populates stateObject.code, so subsequent same-addr EXTCODESIZE / EXTCODEHASH / CALL within the tx are in-struct slice-len calls (~50 ns), not full reader round-trips. Mirrors geth's pattern at core/state/state_object.go ~Code() — pay one read per addr per tx, free for the rest. 2. CodeCache.addrToHash switched from a no-op-when-full maphash.Map[versionedAddressID] to an LRU lru.Cache[[20]byte, versionedAddressID] (hashicorp/golang-lru/v2, already imported elsewhere). Cap derived from the existing byte budget at ~28 bytes/entry (~580 k entries for the 16 MB default). Fresh-address workloads (mainnet thousands of new addrs per block) now warm up the addr layer over time instead of silently dropping new entries forever; matches geth's lru.Cache at core/state/database_code.go. The hashToCode layer is unchanged (content-addressed bytes, immutable, byte-capped with new-entry no-op when full — the same semantic as before since code bytes by codeHash never change). Bench on the EXTCODESIZE-EXISTING_CONTRACT-30M family: 62.34 mgas/s (was 61.50). The marginal gain is small on this bench because BAL prefetch already populates the cache layers; neither lever fires heavily. The expected wins are on non-BAL workloads where EXTCODESIZE-loop patterns repeat within a tx (#1) and fresh-address-churn mainnet blocks fill the addr layer (#2). Updated TestCodeCache_AddrCapacityLimit to assert LRU eviction (was asserting no-op-when-full); the prior behaviour was the bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Nethermind-style addr → 32-byte codeHash LRU sitting above SharedDomains.codeHashForAddr. When the EVM-known codeHash for an address has already been resolved once, subsequent lookups skip the entire AccountsDomain chain (sd.mem → sd.parent.mem → sd.stateCache → tx.GetLatest) and the account-RLP decode. Wiring: - CodeCache adds addrToEthHash *lru.Cache[[20]byte, [32]byte] sized to the existing addrCapacityB budget; methods GetAddrCodeHash / PutAddrCodeHash / DeleteAddrCodeHash. - StateCache wrappers route to the CodeCache instance. - SD.codeHashForAddr probes the LRU first; on miss falls through to the existing chain and populates on the way out (including the zero-hash sentinel for missing-or-EOA accounts — repeat lookups return immediately). - Invalidation: SD.DomainPut for AccountsDomain drops the entry (CREATE / CREATE2-replace path); SD.DomainDel for AccountsDomain also drops the entry (SELFDESTRUCT); StateCache.RevertWithDiffset drops on unwind. Helps non-BAL workloads where codeHashForAddr is currently the cold account-domain probe. On the EXISTING_CONTRACT bench (BAL prefetch already populates everything), this is within noise; the lever is for mainnet workloads where many addresses miss the BAL-prefetch window but the cache is warm from prior lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cache-populating wrapper on the read-ahead worker's TemporalTx previously gated cache writes on `len(v) > 0`. That dropped negative results — i.e. missing accounts, empty storage slots, no-code probes — on the floor. Repeated probes of the same missing address re-paid the file accessor stack walk every time, instead of hitting a cached negative entry. Mirrors the revm pattern that drives reth's 1700-3400 mgas/s on account_access NON_EXISTING / EXISTING_EOA variants: revm represents a missing address as a real CacheAccount{ account: None, status: LoadedNotExisting } and reth's ExecutionCache.account_cache uses FixedCache<Address, Option<Account>> where None is a first-class cacheable value. Bottom of the reth path is: BAL prewarm calls basic_account once → returns None → cache hit forever for that addr. The cycle-2 sweep on account_access[EXTCODESIZE/NON_EXISTING/30M] showed 3.65 → 494 mgas/s without this fix; with the fix the same bench reports 508 mgas/s (within run-to-run noise but trending right). Most of the win was already captured by the readAhead-populates- cache.StateCache wiring (commit cbe9044) and the balcache port (d41e2e8) — those raised the cache hit rate on populated entries enough that the EVM rarely fell through to the file accessor on this bench. The fix is mechanically correct regardless and should matter more on workloads with mixed populated / negative probes across blocks. See agentspecs/reth-missing-eoa-fastpath-2026-05-15.md for the detailed mechanism analysis and the three concrete copy-able patterns from reth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…unters GenericCache.Put has no eviction policy. When the byte budget is reached, new keys are silently dropped until Clear/ClearWithHash/ValidateAndPrepare- mismatch resets the cache. On a long-running node this manifests as a monotonic miss-rate climb that's hard to attribute without instrumentation. Add two counters next to hits/misses: inserts - new keys accepted dropped - new keys rejected at the budget check (the existing branch at the new-key cap; not a behaviour change) PrintStatsAndReset logs both. Sets up the diagnostic baseline before the eviction-policy swap in the follow-up commits on this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the maphash.Map[T] backing store in GenericCache with freelru.ShardedLRU[uint64, entry[T]] (same lib as db/state/cache.go; already in go.mod). Adds a Mode constructor flag: - ModeEvictLRU (default): per-shard LRU evicts the oldest entry on insert when its slot cap is reached. OnEvict drops bytes from currentSize. - ModeNoOp: preserves the historical fill-and-freeze behaviour (silently drop new keys at the byte cap; counted via dropped). Kept as the diagnostic baseline so the regression bench can compare A/B. Per-shard eviction is a known trade-off of freelru.ShardedLRU — RemoveOldest is shard-local, not globally LRU. Matches the trade-off db/state/cache.go / execution/cache/code_cache.go / execution/balcache/balcache.go already accept. LFU (W-TinyLFU, the policy reth uses) is scan-resistant by design and would slot in behind the same Mode wrapper as a follow-up; the seam is documented at policy.go. Key shape: pre-hash via common/maphash.Hash (Go's randomized stdlib hasher, already used by the previous maphash.Map) into uint64; entry stores the full key for collision check. Same pattern as db/state/cache.go. Byte-budget translation: per-domain avg-entry constants in state_cache.go (avgAccountEntryBytes / avgStorageEntryBytes / avgCommitmentEntryBytes) — account / storage are near-fixed sizes so the translation is reliable. capacityBytes becomes a sizing hint plus telemetry (SizeBytes / PrintStatsAndReset). Code domain is unchanged; CodeCache wraps its own LRUs. Adds metrics: inserts, evictions, dropped — all exposed in PrintStatsAndReset alongside the existing hits / misses / hit_rate. Mode is also logged. Touches one external call site: execution/vm/contract.go's jumpDestCache now constructs with ModeEvictLRU. Tests: TestDomainCache_PutCapacityLimit renamed to ..._NoOpMode and asserts the fill-and-freeze contract under explicit ModeNoOp. New TestDomainCache_PutEvictsWhenFull_EvictMode asserts eviction under ModeEvictLRU using a small entry-count cap (the byte→entry translation is approximate; the test uses the entry-count knob via the in-package newGenericCacheEntries constructor to make the assertion deterministic). Pre-existing lint issues on mh/sd-code-cache (intra_block_state.go nilness, preload_parallel.go prealloc) are surfaced by lint non-determinism but are out of this commit's scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single env knob read once at NewStateCache. Default ModeEvictLRU, recognised override "noop" (for the regression-bench baseline so ModeEvictLRU and ModeNoOp can be compared on the same binary). Unrecognised values fall back to evict with a warn log. ModeNoOp engagement is logged at info level because the fill-and-freeze behaviour is a deliberate diagnostic state, not a production setting. Pattern matches db/state/cache.go's D_LRU_ENABLED / D_LRU knobs (dbg.EnvString from common/dbg). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous comment asserted "reth uses W-TinyLFU for state caches" — that is wrong on the execution hot path. Reth's cross-block state cache is `fixed-cache` (PR #21128, v1.11.0): a lock-free direct-mapped / set-associative array with collision-evict semantics. No LRU list, no LFU sketch. Their published wins (~25% newPayload p50 / +33% gas/s) came from *removing* LRU/LFU bookkeeping, not adding LFU. Where reth uses real LRU/LFU it's deliberate and not the execution cache (schnellru::LruMap for networking; moka in precompile_cache.rs explicitly configured with eviction_policy(EvictionPolicy::lru())). The docstring now reflects two follow-up policies both real: - ModeEvictFixedCache (reth's actual choice, more interesting structural option than LFU) - ModeEvictLFU (W-TinyLFU; helps mainnet steady-state, not the cycle-2 bloat fixtures which are pure cold scans) Decision criterion (per agentspecs/lfu-vs-lru-state-cache-decision-2026-05-15.md): ship ModeEvictLFU only if a 24h mainnet replay shows current sharded-LRU hit-rate < 90 % on Account/Storage. Otter is the only credible Go W-TinyLFU library; ristretto has documented correctness bugs and is disqualified for an EL hot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Investigation knob, NOT a permanent default. Account / Storage / Code each capped at 100 MB so the bench measures layer contributions instead of being dominated by preallocated cache memory pressure (1 GB / 1 GB / 512 MB defaults push sys past the GC/page-cache pressure band on this hardware/workload mix). Permanent defaults stay at 1 GB / 1 GB / 512 MB; this commit will be reverted or dynamically gated by relative-to-available sizing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This PR ships the parallel-exec correctness fixes from `mh/parallel-exec-fixes` onto the perf stack, packaged as a focused PR on top of [#21386 (StateCache LRU)](#21386) which itself stacks on [#21380 (State Cache Consolidation)](#21380). > [!IMPORTANT] > **Stacks on #21386 → #21380.** Base is `mh/perf-statecache-lru-pr`, NOT `main`. Merge order: #21380 → #21386 → this PR. > [!IMPORTANT] > **Do not merge until CI is green on both parallel and serial.** Same gating rule as #21380 / #21386. ## Scope — 13 commits from `mh/parallel-exec-fixes` Brought in via a merge commit so the bisection trail is preserved. | sha | what it fixes | |---|---| | `25053e38e9` | parallel SD-of-pre-existing-contract — the 197-line foundational fix | | `2e2bf3ccc0` | clean exit when single-block batch already covered maxBlockNum | | `6e451f5ed2` | don't emit StoragePath=0 writes from IBS.Selfdestruct | | `616a4fa0a8` | clear calc Deleted on a non-SD account write even when zero | | `d99f2f704d` | gate known parallel-exec failures behind EXEC3_PARALLEL (#21136) | | `34e83e82b7` | install per-block changeset accumulator before any of the block's writes | | `b340d7e592` | drop stale sd.mem 'Trim old version entries' comment | | `629cc23566` | O(1) CollectorWrites fee-balance update, drop dead VersionedWrites.SetBalance | | `a0ecfc7e12` | first-match-wins in CollectorWrites BalancePath index | | `445f97e446` | emit EIP-7708 Burn log under parallel-exec when coinbase self-destructs | | `5e1f5fa901` | mirror ReadAccountData SD-revival check into versionedRead | | `a5dc83f509` | drop two stale EXEC3_PARALLEL t.Skips | | `8af901104f` | drop TestReceiptHashFromRPC unit-suite RPC integration test | ## Merge conflicts resolved 3 files, 8 regions — all resolved by keeping HEAD's typed-readset / per-path revival shape and confirming HEAD already absorbs each fix's intent. See the merge commit message (`cfc4ec1418`) for the per-region rationale. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Mark Holt <erigon@dev-bm-e3-ethmainnet-n4.erigon.io>

…statecache-lru-pr

…govet) stateObject and s are both verified non-nil earlier in their respective scopes; the secondary checks at lines 749 and 783 are redundant. govet nilness check fails on these.

mh0lt requested review from AskAlexSharov, sudeepdino008 and yperbasis as code owners May 24, 2026 11:20

mh0lt mentioned this pull request May 24, 2026

Parallel-exec correctness fixes (PR #3 of the perf stack) #21387

Merged

Mark Holt and others added 11 commits May 25, 2026 07:28

mh0lt force-pushed the mh/perf-statecache-lru-pr branch from 266e297 to 4a512ce Compare May 25, 2026 07:29

mh0lt and others added 3 commits May 25, 2026 15:09

Merge remote-tracking branch 'origin/mh/perf-caches-pr' into mh/perf-…

b7d89ad

…statecache-lru-pr

execution/state: drop tautological nil checks in code-loading paths (…

9b15362

…govet) stateObject and s are both verified non-nil earlier in their respective scopes; the secondary checks at lines 749 and 783 are redundant. govet nilness check fails on these.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StateCache LRU + Mode rework (PR #2 of the perf stack)#21386

StateCache LRU + Mode rework (PR #2 of the perf stack)#21386
mh0lt wants to merge 14 commits into
mh/perf-caches-prfrom
mh/perf-statecache-lru-pr

mh0lt commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mh0lt commented May 24, 2026

Scope — 11 commits cherry-picked from mh/all-stack

One commit deferred

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Scope — 11 commits cherry-picked from `mh/all-stack`