StateCache LRU + Mode rework (PR #2 of the perf stack)#21386
Open
mh0lt wants to merge 14 commits into
Open
Conversation
…CodeDomain Adds a third map (`ethHashToCode`) to CodeCache, keyed by the 32-byte Ethereum codeHash (keccak256). New methods `GetByEthHash` and `PutWithEthHash` expose direct L2b access without going through the addr→maphash→code two-level path. The byte storage duplicates L2 in the worst case (2x code-bytes memory at the cap); accepted for the per-key fast path on many-addrs-one-code workloads. `SharedDomains.GetLatest(CodeDomain, ...)` consults L2b transparently: when the addr-keyed cache misses, resolve the codeHash from the AccountsDomain (typically warm because the EVM just loaded the account), probe `stateCache.GetCodeByHash` before falling through to the file accessor stack. On miss, fill both L1 and L2b via PutCodeWithHash. The fast path is unchanged. Workload shape this targets: many addresses sharing one codeHash (proxies, factory-deployed clones, ERC-20 holders, OpenZeppelin templates). Today's addr-keyed cache misses on every fresh address even when the bytecode is already cached. With this change a single L2b entry serves N addresses after the first population. Microbench results: - BenchmarkCodeCache_GetByEthHash_Hit: 17.01 ns/op - BenchmarkCodeCache_GetByEthHash_Miss: 15.45 ns/op - BenchmarkCodeCache_Get_AddrLevel_Hit: 32.44 ns/op (existing) - BenchmarkCodeCache_GetByEthHash_ManyAddrs: 17.02 ns/op L2b hit is ~2x faster than the existing two-level addr path (one map probe vs two), and enables hits on workloads where L1 would miss. Cross-client research at agentspecs/cross-client-state-access-2026-05-14.md notes geth's separate codeSizeCache as the further (geth-proven) win for EXTCODESIZE/EXTCODEHASH and addrToHash LRU as a one-line behaviour fix; both queued as follow-up surgical commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SIZE / EXTCODEHASH Adds a third caching layer to CodeCache (alongside L1 addr→maphash and L2b ethHash→bytes): codeSizeByEthHash maps the 32-byte Ethereum codeHash to its byte length. Tiny per-entry footprint (32B key + 8B value vs 5-10 KB for full bytes) so the same memory budget gives ~1000x the hit surface. Capped at 1M entries (geth core/state/database_code.go uses the same size). EXTCODESIZE / EXTCODEHASH callers — historically the slowest opcodes on the lab dashboard's bench — answer from a single map probe without paying the file accessor stack cost of the full bytes. Geth-proven; cross-client writeup at agentspecs/cross-client-state-access-2026-05-14.md notes this as the largest single available win for the synthetic bench. Wiring: - CodeCache.GetCodeSizeByEthHash / PutCodeSizeByEthHash — direct accessors. - PutWithEthHash now populates the size layer alongside L2b, so every bytes-load creates a future fast-path entry "for free". - StateCache wrappers GetCodeSizeByHash / PutCodeSizeByHash. - SharedDomains.GetCodeSize(tx, addr) — the SD-transparent fast path: resolve codeHash via the AccountsDomain cache chain, probe the size cache, then L2b, then file-read+populate. Returns (0, false, nil) for EOAs and no-code accounts without paying any file read. - temporalGetter.GetCodeSize so callers reach it via the existing getter. - ReaderV3.ReadAccountCodeSize type-asserts on a codeSizeGetter interface and routes through the fast path when the underlying getter supports it; falls back to GetLatest+len otherwise. No kv.TemporalGetter interface change. Limitation: capacity is no-op-when-full, not LRU. A separate surgical commit will swap to real LRU eviction; mirrors the addrToHash fix queued from the same cross-client writeup. Tests: 3 new (PopulatedAlongsideBytes, DirectPutAndGet, EmptyHashOrNegativeIsNoOp). All existing CodeCache tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e.StateCache The BlockReadAheader has always prefetched BAL-listed (and access-list) addresses' account/code/storage via a fresh ReaderV3 on a separate RoTx. Its prefetches warmed OS page cache + RoTx cursors — disconnected from the process-global cache.StateCache that SharedDomains.GetLatest probes on the EVM hot path. The two layers were two separate caches; nothing the prefetcher loaded ever reached the EVM's lookup path. Reth's structural advantage on EXTCODESIZE-loop benches is that its prewarm writes to the same hashmap the EVM reads from (crates/engine/execution-cache/src/cached_state.rs:663). When EVM enters, every BAL-listed addr's first read is a 20 ns cache probe — no file accessor stack, no decompression CPU. PR #21128 swapped this from mini-moka to a lock-free fixed-cache for a measured +10.8 % mgas/s. This commit closes the equivalent gap on Erigon: a thin cache-populating TemporalGetter wrapper writes successful reads through to cache.StateCache as a side effect. ReaderV3 is unchanged; the wrapper sits in front. When the prefetcher already has the codeHash from a preceding account read, the next CodeDomain read routes through StateCache.PutCodeWithHash so the L2b (ethHash → bytes) + size-cache layers fill alongside the bare addr-keyed L1. Wiring: - BlockReadAheader.SetStateCache(*cache.StateCache) setter. - ExecModule construction calls readAheader.SetStateCache(domainCache), so the same StateCache the FCU/canonical path wires onto SD is the one the prefetcher warms. - cachePopulatingGetter wraps the worker's ttx; both BAL-warming and tx-warming paths gain the same treatment. Fgprof on the EXTCODESIZE-EXISTING_CONTRACT-30M bench had shown 95 % of EVM wall-clock in seg.Getter.nextPos (Huffman decompression of code values). With this commit, every BAL-listed addr's lookup should hit the cache and skip the file accessor stack entirely — eliminating the dominant cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ash LRU Two surgical commits bundled (both touch the code-read hot path): 1. IntraBlockState.GetCodeSize now loads the full bytes via stateReader.ReadAccountCode on first touch and populates stateObject.code, so subsequent same-addr EXTCODESIZE / EXTCODEHASH / CALL within the tx are in-struct slice-len calls (~50 ns), not full reader round-trips. Mirrors geth's pattern at core/state/state_object.go ~Code() — pay one read per addr per tx, free for the rest. 2. CodeCache.addrToHash switched from a no-op-when-full maphash.Map[versionedAddressID] to an LRU lru.Cache[[20]byte, versionedAddressID] (hashicorp/golang-lru/v2, already imported elsewhere). Cap derived from the existing byte budget at ~28 bytes/entry (~580 k entries for the 16 MB default). Fresh-address workloads (mainnet thousands of new addrs per block) now warm up the addr layer over time instead of silently dropping new entries forever; matches geth's lru.Cache at core/state/database_code.go. The hashToCode layer is unchanged (content-addressed bytes, immutable, byte-capped with new-entry no-op when full — the same semantic as before since code bytes by codeHash never change). Bench on the EXTCODESIZE-EXISTING_CONTRACT-30M family: 62.34 mgas/s (was 61.50). The marginal gain is small on this bench because BAL prefetch already populates the cache layers; neither lever fires heavily. The expected wins are on non-BAL workloads where EXTCODESIZE-loop patterns repeat within a tx (#1) and fresh-address-churn mainnet blocks fill the addr layer (#2). Updated TestCodeCache_AddrCapacityLimit to assert LRU eviction (was asserting no-op-when-full); the prior behaviour was the bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nethermind-style addr → 32-byte codeHash LRU sitting above SharedDomains.codeHashForAddr. When the EVM-known codeHash for an address has already been resolved once, subsequent lookups skip the entire AccountsDomain chain (sd.mem → sd.parent.mem → sd.stateCache → tx.GetLatest) and the account-RLP decode. Wiring: - CodeCache adds addrToEthHash *lru.Cache[[20]byte, [32]byte] sized to the existing addrCapacityB budget; methods GetAddrCodeHash / PutAddrCodeHash / DeleteAddrCodeHash. - StateCache wrappers route to the CodeCache instance. - SD.codeHashForAddr probes the LRU first; on miss falls through to the existing chain and populates on the way out (including the zero-hash sentinel for missing-or-EOA accounts — repeat lookups return immediately). - Invalidation: SD.DomainPut for AccountsDomain drops the entry (CREATE / CREATE2-replace path); SD.DomainDel for AccountsDomain also drops the entry (SELFDESTRUCT); StateCache.RevertWithDiffset drops on unwind. Helps non-BAL workloads where codeHashForAddr is currently the cold account-domain probe. On the EXISTING_CONTRACT bench (BAL prefetch already populates everything), this is within noise; the lever is for mainnet workloads where many addresses miss the BAL-prefetch window but the cache is warm from prior lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-populating wrapper on the read-ahead worker's TemporalTx
previously gated cache writes on `len(v) > 0`. That dropped negative
results — i.e. missing accounts, empty storage slots, no-code probes —
on the floor. Repeated probes of the same missing address re-paid the
file accessor stack walk every time, instead of hitting a cached
negative entry.
Mirrors the revm pattern that drives reth's 1700-3400 mgas/s on
account_access NON_EXISTING / EXISTING_EOA variants: revm represents
a missing address as a real CacheAccount{ account: None, status:
LoadedNotExisting } and reth's ExecutionCache.account_cache uses
FixedCache<Address, Option<Account>> where None is a first-class
cacheable value. Bottom of the reth path is: BAL prewarm calls
basic_account once → returns None → cache hit forever for that addr.
The cycle-2 sweep on account_access[EXTCODESIZE/NON_EXISTING/30M]
showed 3.65 → 494 mgas/s without this fix; with the fix the same
bench reports 508 mgas/s (within run-to-run noise but trending right).
Most of the win was already captured by the readAhead-populates-
cache.StateCache wiring (commit cbe9044) and the balcache port
(d41e2e8) — those raised the cache hit rate on populated entries
enough that the EVM rarely fell through to the file accessor on
this bench. The fix is mechanically correct regardless and should
matter more on workloads with mixed populated / negative probes
across blocks.
See agentspecs/reth-missing-eoa-fastpath-2026-05-15.md for the
detailed mechanism analysis and the three concrete copy-able
patterns from reth.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unters
GenericCache.Put has no eviction policy. When the byte budget is reached,
new keys are silently dropped until Clear/ClearWithHash/ValidateAndPrepare-
mismatch resets the cache. On a long-running node this manifests as a
monotonic miss-rate climb that's hard to attribute without instrumentation.
Add two counters next to hits/misses:
inserts - new keys accepted
dropped - new keys rejected at the budget check (the existing branch
at the new-key cap; not a behaviour change)
PrintStatsAndReset logs both. Sets up the diagnostic baseline before the
eviction-policy swap in the follow-up commits on this branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the maphash.Map[T] backing store in GenericCache with
freelru.ShardedLRU[uint64, entry[T]] (same lib as db/state/cache.go;
already in go.mod). Adds a Mode constructor flag:
- ModeEvictLRU (default): per-shard LRU evicts the oldest entry on
insert when its slot cap is reached. OnEvict drops bytes from
currentSize.
- ModeNoOp: preserves the historical fill-and-freeze behaviour
(silently drop new keys at the byte cap; counted via dropped).
Kept as the diagnostic baseline so the regression bench can
compare A/B.
Per-shard eviction is a known trade-off of freelru.ShardedLRU —
RemoveOldest is shard-local, not globally LRU. Matches the trade-off
db/state/cache.go / execution/cache/code_cache.go /
execution/balcache/balcache.go already accept. LFU (W-TinyLFU, the
policy reth uses) is scan-resistant by design and would slot in
behind the same Mode wrapper as a follow-up; the seam is documented
at policy.go.
Key shape: pre-hash via common/maphash.Hash (Go's randomized stdlib
hasher, already used by the previous maphash.Map) into uint64; entry
stores the full key for collision check. Same pattern as
db/state/cache.go.
Byte-budget translation: per-domain avg-entry constants in
state_cache.go (avgAccountEntryBytes / avgStorageEntryBytes /
avgCommitmentEntryBytes) — account / storage are near-fixed sizes so
the translation is reliable. capacityBytes becomes a sizing hint
plus telemetry (SizeBytes / PrintStatsAndReset). Code domain is
unchanged; CodeCache wraps its own LRUs.
Adds metrics: inserts, evictions, dropped — all exposed in
PrintStatsAndReset alongside the existing hits / misses / hit_rate.
Mode is also logged.
Touches one external call site: execution/vm/contract.go's
jumpDestCache now constructs with ModeEvictLRU.
Tests: TestDomainCache_PutCapacityLimit renamed to ..._NoOpMode and
asserts the fill-and-freeze contract under explicit ModeNoOp. New
TestDomainCache_PutEvictsWhenFull_EvictMode asserts eviction under
ModeEvictLRU using a small entry-count cap (the byte→entry
translation is approximate; the test uses the entry-count knob via
the in-package newGenericCacheEntries constructor to make the
assertion deterministic).
Pre-existing lint issues on mh/sd-code-cache (intra_block_state.go
nilness, preload_parallel.go prealloc) are surfaced by lint
non-determinism but are out of this commit's scope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single env knob read once at NewStateCache. Default ModeEvictLRU, recognised override "noop" (for the regression-bench baseline so ModeEvictLRU and ModeNoOp can be compared on the same binary). Unrecognised values fall back to evict with a warn log. ModeNoOp engagement is logged at info level because the fill-and-freeze behaviour is a deliberate diagnostic state, not a production setting. Pattern matches db/state/cache.go's D_LRU_ENABLED / D_LRU knobs (dbg.EnvString from common/dbg). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous comment asserted "reth uses W-TinyLFU for state caches" — that is wrong on the execution hot path. Reth's cross-block state cache is `fixed-cache` (PR #21128, v1.11.0): a lock-free direct-mapped / set-associative array with collision-evict semantics. No LRU list, no LFU sketch. Their published wins (~25% newPayload p50 / +33% gas/s) came from *removing* LRU/LFU bookkeeping, not adding LFU. Where reth uses real LRU/LFU it's deliberate and not the execution cache (schnellru::LruMap for networking; moka in precompile_cache.rs explicitly configured with eviction_policy(EvictionPolicy::lru())). The docstring now reflects two follow-up policies both real: - ModeEvictFixedCache (reth's actual choice, more interesting structural option than LFU) - ModeEvictLFU (W-TinyLFU; helps mainnet steady-state, not the cycle-2 bloat fixtures which are pure cold scans) Decision criterion (per agentspecs/lfu-vs-lru-state-cache-decision-2026-05-15.md): ship ModeEvictLFU only if a 24h mainnet replay shows current sharded-LRU hit-rate < 90 % on Account/Storage. Otter is the only credible Go W-TinyLFU library; ristretto has documented correctness bugs and is disqualified for an EL hot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Investigation knob, NOT a permanent default. Account / Storage / Code each capped at 100 MB so the bench measures layer contributions instead of being dominated by preallocated cache memory pressure (1 GB / 1 GB / 512 MB defaults push sys past the GC/page-cache pressure band on this hardware/workload mix). Permanent defaults stay at 1 GB / 1 GB / 512 MB; this commit will be reverted or dynamically gated by relative-to-available sizing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
266e297 to
4a512ce
Compare
This PR ships the parallel-exec correctness fixes from `mh/parallel-exec-fixes` onto the perf stack, packaged as a focused PR on top of [#21386 (StateCache LRU)](#21386) which itself stacks on [#21380 (State Cache Consolidation)](#21380). > [!IMPORTANT] > **Stacks on #21386 → #21380.** Base is `mh/perf-statecache-lru-pr`, NOT `main`. Merge order: #21380 → #21386 → this PR. > [!IMPORTANT] > **Do not merge until CI is green on both parallel and serial.** Same gating rule as #21380 / #21386. ## Scope — 13 commits from `mh/parallel-exec-fixes` Brought in via a merge commit so the bisection trail is preserved. | sha | what it fixes | |---|---| | `25053e38e9` | parallel SD-of-pre-existing-contract — the 197-line foundational fix | | `2e2bf3ccc0` | clean exit when single-block batch already covered maxBlockNum | | `6e451f5ed2` | don't emit StoragePath=0 writes from IBS.Selfdestruct | | `616a4fa0a8` | clear calc Deleted on a non-SD account write even when zero | | `d99f2f704d` | gate known parallel-exec failures behind EXEC3_PARALLEL (#21136) | | `34e83e82b7` | install per-block changeset accumulator before any of the block's writes | | `b340d7e592` | drop stale sd.mem 'Trim old version entries' comment | | `629cc23566` | O(1) CollectorWrites fee-balance update, drop dead VersionedWrites.SetBalance | | `a0ecfc7e12` | first-match-wins in CollectorWrites BalancePath index | | `445f97e446` | emit EIP-7708 Burn log under parallel-exec when coinbase self-destructs | | `5e1f5fa901` | mirror ReadAccountData SD-revival check into versionedRead | | `a5dc83f509` | drop two stale EXEC3_PARALLEL t.Skips | | `8af901104f` | drop TestReceiptHashFromRPC unit-suite RPC integration test | ## Merge conflicts resolved 3 files, 8 regions — all resolved by keeping HEAD's typed-readset / per-path revival shape and confirming HEAD already absorbs each fix's intent. See the merge commit message (`cfc4ec1418`) for the per-region rationale. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Mark Holt <erigon@dev-bm-e3-ethmainnet-n4.erigon.io>
…statecache-lru-pr
…govet) stateObject and s are both verified non-nil earlier in their respective scopes; the secondary checks at lines 749 and 783 are redundant. govet nilness check fails on these.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR ships the
execution/cacheLRU/Mode rework + the StateCache population commits as a follow-on to PR #21380 (State Cache Consolidation). The LRU/Mode rework was always meant to ship separately so the policy change can be reviewed independently of #21380's BranchCache work.Important
Stacks on #21380. Base is
mh/perf-caches-pr, NOTmain. Merge order: #21380 → this PR.Important
Do not merge until CI is green on both parallel and serial.
Scope — 11 commits cherry-picked from
mh/all-stackcb4443bf51fba4ce8999execution/cache, db/state/execctx: SD-transparent ethHash bypass for CodeDomaind75ec41fcd7d0998d0dbexecution/cache, db/state, execution/state: codeSizeCache for EXTCODESIZE / EXTCODEHASH77cf879d9acbe9044e52execution/exec, execution/execmodule: BlockReadAheader populates cache.StateCache67297a5dfef2d4c3df74execution/state, execution/cache: stateObject.code populate + addrToHash LRUcca736e34d7c3e054063execution/cache, db/state/execctx: addr → codeHash LRU above SD2a21a81608c8f10544c0execution/exec: cachePopulatingGetter caches negative results2eea7d2c61d01a345062execution/cache: surface fill-and-freeze cliff via inserts/dropped counters576c5ade3e8052c84831execution/cache: replace GenericCache map with sharded LRU + Mode8e239f35186b785d4360execution/cache: STATE_CACHE_MODE env override at NewStateCache timead9f74c897c55128565aexecution/cache: correct the LFU rationale in Mode docstring266e2979bdf80655f6d2execution/cache: reduce default cache caps to 100 MB each (bench knob)One commit deferred
The 12th commit on the original handoff list —
66bcc44702(BAL-driven BlockStateCache prewarm) — has been dropped from this PR because it depends on theexecution/balcachepackage, which is introduced by PR-A (eth/71 BAL wire protocol) offmain. It will be reintroduced as a small follow-up PR once both this PR and PR-A have merged.🤖 Generated with Claude Code