Skip to content

StateCache LRU + Mode rework (PR #2 of the perf stack)#21386

Open
mh0lt wants to merge 14 commits into
mh/perf-caches-prfrom
mh/perf-statecache-lru-pr
Open

StateCache LRU + Mode rework (PR #2 of the perf stack)#21386
mh0lt wants to merge 14 commits into
mh/perf-caches-prfrom
mh/perf-statecache-lru-pr

Conversation

@mh0lt
Copy link
Copy Markdown
Contributor

@mh0lt mh0lt commented May 24, 2026

This PR ships the execution/cache LRU/Mode rework + the StateCache population commits as a follow-on to PR #21380 (State Cache Consolidation). The LRU/Mode rework was always meant to ship separately so the policy change can be reviewed independently of #21380's BranchCache work.

Important

Stacks on #21380. Base is mh/perf-caches-pr, NOT main. Merge order: #21380 → this PR.

Important

Do not merge until CI is green on both parallel and serial.

Scope — 11 commits cherry-picked from mh/all-stack

sha (rebased) source subject
cb4443bf51 fba4ce8999 execution/cache, db/state/execctx: SD-transparent ethHash bypass for CodeDomain
d75ec41fcd 7d0998d0db execution/cache, db/state, execution/state: codeSizeCache for EXTCODESIZE / EXTCODEHASH
77cf879d9a cbe9044e52 execution/exec, execution/execmodule: BlockReadAheader populates cache.StateCache
67297a5dfe f2d4c3df74 execution/state, execution/cache: stateObject.code populate + addrToHash LRU
cca736e34d 7c3e054063 execution/cache, db/state/execctx: addr → codeHash LRU above SD
2a21a81608 c8f10544c0 execution/exec: cachePopulatingGetter caches negative results
2eea7d2c61 d01a345062 execution/cache: surface fill-and-freeze cliff via inserts/dropped counters
576c5ade3e 8052c84831 execution/cache: replace GenericCache map with sharded LRU + Mode
8e239f3518 6b785d4360 execution/cache: STATE_CACHE_MODE env override at NewStateCache time
ad9f74c897 c55128565a execution/cache: correct the LFU rationale in Mode docstring
266e2979bd f80655f6d2 execution/cache: reduce default cache caps to 100 MB each (bench knob)

One commit deferred

The 12th commit on the original handoff list — 66bcc44702 (BAL-driven BlockStateCache prewarm) — has been dropped from this PR because it depends on the execution/balcache package, which is introduced by PR-A (eth/71 BAL wire protocol) off main. It will be reintroduced as a small follow-up PR once both this PR and PR-A have merged.

🤖 Generated with Claude Code

Mark Holt and others added 11 commits May 25, 2026 07:28
…CodeDomain

Adds a third map (`ethHashToCode`) to CodeCache, keyed by the 32-byte
Ethereum codeHash (keccak256). New methods `GetByEthHash` and
`PutWithEthHash` expose direct L2b access without going through the
addr→maphash→code two-level path. The byte storage duplicates L2 in the
worst case (2x code-bytes memory at the cap); accepted for the per-key
fast path on many-addrs-one-code workloads.

`SharedDomains.GetLatest(CodeDomain, ...)` consults L2b transparently:
when the addr-keyed cache misses, resolve the codeHash from the
AccountsDomain (typically warm because the EVM just loaded the account),
probe `stateCache.GetCodeByHash` before falling through to the file
accessor stack. On miss, fill both L1 and L2b via PutCodeWithHash. The
fast path is unchanged.

Workload shape this targets: many addresses sharing one codeHash
(proxies, factory-deployed clones, ERC-20 holders, OpenZeppelin
templates). Today's addr-keyed cache misses on every fresh address even
when the bytecode is already cached. With this change a single L2b
entry serves N addresses after the first population.

Microbench results:
- BenchmarkCodeCache_GetByEthHash_Hit:       17.01 ns/op
- BenchmarkCodeCache_GetByEthHash_Miss:      15.45 ns/op
- BenchmarkCodeCache_Get_AddrLevel_Hit:      32.44 ns/op (existing)
- BenchmarkCodeCache_GetByEthHash_ManyAddrs: 17.02 ns/op

L2b hit is ~2x faster than the existing two-level addr path (one map
probe vs two), and enables hits on workloads where L1 would miss.

Cross-client research at agentspecs/cross-client-state-access-2026-05-14.md
notes geth's separate codeSizeCache as the further (geth-proven) win
for EXTCODESIZE/EXTCODEHASH and addrToHash LRU as a one-line behaviour
fix; both queued as follow-up surgical commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SIZE / EXTCODEHASH

Adds a third caching layer to CodeCache (alongside L1 addr→maphash and L2b
ethHash→bytes): codeSizeByEthHash maps the 32-byte Ethereum codeHash to
its byte length. Tiny per-entry footprint (32B key + 8B value vs 5-10 KB
for full bytes) so the same memory budget gives ~1000x the hit surface.
Capped at 1M entries (geth core/state/database_code.go uses the same size).

EXTCODESIZE / EXTCODEHASH callers — historically the slowest opcodes on
the lab dashboard's bench — answer from a single map probe without paying
the file accessor stack cost of the full bytes. Geth-proven; cross-client
writeup at agentspecs/cross-client-state-access-2026-05-14.md notes this
as the largest single available win for the synthetic bench.

Wiring:
- CodeCache.GetCodeSizeByEthHash / PutCodeSizeByEthHash — direct accessors.
- PutWithEthHash now populates the size layer alongside L2b, so every
  bytes-load creates a future fast-path entry "for free".
- StateCache wrappers GetCodeSizeByHash / PutCodeSizeByHash.
- SharedDomains.GetCodeSize(tx, addr) — the SD-transparent fast path:
  resolve codeHash via the AccountsDomain cache chain, probe the size
  cache, then L2b, then file-read+populate. Returns (0, false, nil) for
  EOAs and no-code accounts without paying any file read.
- temporalGetter.GetCodeSize so callers reach it via the existing getter.
- ReaderV3.ReadAccountCodeSize type-asserts on a codeSizeGetter interface
  and routes through the fast path when the underlying getter supports it;
  falls back to GetLatest+len otherwise. No kv.TemporalGetter interface
  change.

Limitation: capacity is no-op-when-full, not LRU. A separate surgical
commit will swap to real LRU eviction; mirrors the addrToHash fix queued
from the same cross-client writeup.

Tests: 3 new (PopulatedAlongsideBytes, DirectPutAndGet, EmptyHashOrNegativeIsNoOp).
All existing CodeCache tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e.StateCache

The BlockReadAheader has always prefetched BAL-listed (and access-list)
addresses' account/code/storage via a fresh ReaderV3 on a separate RoTx.
Its prefetches warmed OS page cache + RoTx cursors — disconnected from
the process-global cache.StateCache that SharedDomains.GetLatest probes
on the EVM hot path. The two layers were two separate caches; nothing
the prefetcher loaded ever reached the EVM's lookup path.

Reth's structural advantage on EXTCODESIZE-loop benches is that its
prewarm writes to the same hashmap the EVM reads from
(crates/engine/execution-cache/src/cached_state.rs:663). When EVM enters,
every BAL-listed addr's first read is a 20 ns cache probe — no file
accessor stack, no decompression CPU. PR #21128 swapped this from
mini-moka to a lock-free fixed-cache for a measured +10.8 % mgas/s.

This commit closes the equivalent gap on Erigon: a thin cache-populating
TemporalGetter wrapper writes successful reads through to cache.StateCache
as a side effect. ReaderV3 is unchanged; the wrapper sits in front. When
the prefetcher already has the codeHash from a preceding account read,
the next CodeDomain read routes through StateCache.PutCodeWithHash so
the L2b (ethHash → bytes) + size-cache layers fill alongside the bare
addr-keyed L1.

Wiring:
- BlockReadAheader.SetStateCache(*cache.StateCache) setter.
- ExecModule construction calls readAheader.SetStateCache(domainCache),
  so the same StateCache the FCU/canonical path wires onto SD is the one
  the prefetcher warms.
- cachePopulatingGetter wraps the worker's ttx; both BAL-warming and
  tx-warming paths gain the same treatment.

Fgprof on the EXTCODESIZE-EXISTING_CONTRACT-30M bench had shown 95 % of
EVM wall-clock in seg.Getter.nextPos (Huffman decompression of code
values). With this commit, every BAL-listed addr's lookup should hit
the cache and skip the file accessor stack entirely — eliminating the
dominant cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ash LRU

Two surgical commits bundled (both touch the code-read hot path):

1. IntraBlockState.GetCodeSize now loads the full bytes via
   stateReader.ReadAccountCode on first touch and populates
   stateObject.code, so subsequent same-addr EXTCODESIZE /
   EXTCODEHASH / CALL within the tx are in-struct slice-len calls
   (~50 ns), not full reader round-trips. Mirrors geth's pattern
   at core/state/state_object.go ~Code() — pay one read per addr
   per tx, free for the rest.

2. CodeCache.addrToHash switched from a no-op-when-full
   maphash.Map[versionedAddressID] to an LRU
   lru.Cache[[20]byte, versionedAddressID] (hashicorp/golang-lru/v2,
   already imported elsewhere). Cap derived from the existing byte
   budget at ~28 bytes/entry (~580 k entries for the 16 MB default).
   Fresh-address workloads (mainnet thousands of new addrs per
   block) now warm up the addr layer over time instead of silently
   dropping new entries forever; matches geth's lru.Cache at
   core/state/database_code.go.

   The hashToCode layer is unchanged (content-addressed bytes,
   immutable, byte-capped with new-entry no-op when full — the same
   semantic as before since code bytes by codeHash never change).

Bench on the EXTCODESIZE-EXISTING_CONTRACT-30M family: 62.34 mgas/s
(was 61.50). The marginal gain is small on this bench because BAL
prefetch already populates the cache layers; neither lever fires
heavily. The expected wins are on non-BAL workloads where
EXTCODESIZE-loop patterns repeat within a tx (#1) and
fresh-address-churn mainnet blocks fill the addr layer (#2).

Updated TestCodeCache_AddrCapacityLimit to assert LRU eviction
(was asserting no-op-when-full); the prior behaviour was the bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nethermind-style addr → 32-byte codeHash LRU sitting above
SharedDomains.codeHashForAddr. When the EVM-known codeHash for an
address has already been resolved once, subsequent lookups skip the
entire AccountsDomain chain (sd.mem → sd.parent.mem → sd.stateCache →
tx.GetLatest) and the account-RLP decode.

Wiring:
- CodeCache adds addrToEthHash *lru.Cache[[20]byte, [32]byte] sized
  to the existing addrCapacityB budget; methods GetAddrCodeHash /
  PutAddrCodeHash / DeleteAddrCodeHash.
- StateCache wrappers route to the CodeCache instance.
- SD.codeHashForAddr probes the LRU first; on miss falls through to
  the existing chain and populates on the way out (including the
  zero-hash sentinel for missing-or-EOA accounts — repeat lookups
  return immediately).
- Invalidation: SD.DomainPut for AccountsDomain drops the entry
  (CREATE / CREATE2-replace path); SD.DomainDel for AccountsDomain
  also drops the entry (SELFDESTRUCT); StateCache.RevertWithDiffset
  drops on unwind.

Helps non-BAL workloads where codeHashForAddr is currently the cold
account-domain probe. On the EXISTING_CONTRACT bench (BAL prefetch
already populates everything), this is within noise; the lever is for
mainnet workloads where many addresses miss the BAL-prefetch window
but the cache is warm from prior lookups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-populating wrapper on the read-ahead worker's TemporalTx
previously gated cache writes on `len(v) > 0`. That dropped negative
results — i.e. missing accounts, empty storage slots, no-code probes —
on the floor. Repeated probes of the same missing address re-paid the
file accessor stack walk every time, instead of hitting a cached
negative entry.

Mirrors the revm pattern that drives reth's 1700-3400 mgas/s on
account_access NON_EXISTING / EXISTING_EOA variants: revm represents
a missing address as a real CacheAccount{ account: None, status:
LoadedNotExisting } and reth's ExecutionCache.account_cache uses
FixedCache<Address, Option<Account>> where None is a first-class
cacheable value. Bottom of the reth path is: BAL prewarm calls
basic_account once → returns None → cache hit forever for that addr.

The cycle-2 sweep on account_access[EXTCODESIZE/NON_EXISTING/30M]
showed 3.65 → 494 mgas/s without this fix; with the fix the same
bench reports 508 mgas/s (within run-to-run noise but trending right).
Most of the win was already captured by the readAhead-populates-
cache.StateCache wiring (commit cbe9044) and the balcache port
(d41e2e8) — those raised the cache hit rate on populated entries
enough that the EVM rarely fell through to the file accessor on
this bench. The fix is mechanically correct regardless and should
matter more on workloads with mixed populated / negative probes
across blocks.

See agentspecs/reth-missing-eoa-fastpath-2026-05-15.md for the
detailed mechanism analysis and the three concrete copy-able
patterns from reth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unters

GenericCache.Put has no eviction policy. When the byte budget is reached,
new keys are silently dropped until Clear/ClearWithHash/ValidateAndPrepare-
mismatch resets the cache. On a long-running node this manifests as a
monotonic miss-rate climb that's hard to attribute without instrumentation.

Add two counters next to hits/misses:
  inserts - new keys accepted
  dropped - new keys rejected at the budget check (the existing branch
            at the new-key cap; not a behaviour change)

PrintStatsAndReset logs both. Sets up the diagnostic baseline before the
eviction-policy swap in the follow-up commits on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the maphash.Map[T] backing store in GenericCache with
freelru.ShardedLRU[uint64, entry[T]] (same lib as db/state/cache.go;
already in go.mod). Adds a Mode constructor flag:

  - ModeEvictLRU (default): per-shard LRU evicts the oldest entry on
    insert when its slot cap is reached. OnEvict drops bytes from
    currentSize.
  - ModeNoOp: preserves the historical fill-and-freeze behaviour
    (silently drop new keys at the byte cap; counted via dropped).
    Kept as the diagnostic baseline so the regression bench can
    compare A/B.

Per-shard eviction is a known trade-off of freelru.ShardedLRU —
RemoveOldest is shard-local, not globally LRU. Matches the trade-off
db/state/cache.go / execution/cache/code_cache.go /
execution/balcache/balcache.go already accept. LFU (W-TinyLFU, the
policy reth uses) is scan-resistant by design and would slot in
behind the same Mode wrapper as a follow-up; the seam is documented
at policy.go.

Key shape: pre-hash via common/maphash.Hash (Go's randomized stdlib
hasher, already used by the previous maphash.Map) into uint64; entry
stores the full key for collision check. Same pattern as
db/state/cache.go.

Byte-budget translation: per-domain avg-entry constants in
state_cache.go (avgAccountEntryBytes / avgStorageEntryBytes /
avgCommitmentEntryBytes) — account / storage are near-fixed sizes so
the translation is reliable. capacityBytes becomes a sizing hint
plus telemetry (SizeBytes / PrintStatsAndReset). Code domain is
unchanged; CodeCache wraps its own LRUs.

Adds metrics: inserts, evictions, dropped — all exposed in
PrintStatsAndReset alongside the existing hits / misses / hit_rate.
Mode is also logged.

Touches one external call site: execution/vm/contract.go's
jumpDestCache now constructs with ModeEvictLRU.

Tests: TestDomainCache_PutCapacityLimit renamed to ..._NoOpMode and
asserts the fill-and-freeze contract under explicit ModeNoOp. New
TestDomainCache_PutEvictsWhenFull_EvictMode asserts eviction under
ModeEvictLRU using a small entry-count cap (the byte→entry
translation is approximate; the test uses the entry-count knob via
the in-package newGenericCacheEntries constructor to make the
assertion deterministic).

Pre-existing lint issues on mh/sd-code-cache (intra_block_state.go
nilness, preload_parallel.go prealloc) are surfaced by lint
non-determinism but are out of this commit's scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single env knob read once at NewStateCache. Default ModeEvictLRU,
recognised override "noop" (for the regression-bench baseline so
ModeEvictLRU and ModeNoOp can be compared on the same binary).
Unrecognised values fall back to evict with a warn log.

ModeNoOp engagement is logged at info level because the
fill-and-freeze behaviour is a deliberate diagnostic state, not a
production setting.

Pattern matches db/state/cache.go's D_LRU_ENABLED / D_LRU knobs
(dbg.EnvString from common/dbg).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous comment asserted "reth uses W-TinyLFU for state caches" —
that is wrong on the execution hot path. Reth's cross-block state cache
is `fixed-cache` (PR #21128, v1.11.0): a lock-free direct-mapped /
set-associative array with collision-evict semantics. No LRU list, no
LFU sketch. Their published wins (~25% newPayload p50 / +33% gas/s) came
from *removing* LRU/LFU bookkeeping, not adding LFU.

Where reth uses real LRU/LFU it's deliberate and not the execution cache
(schnellru::LruMap for networking; moka in precompile_cache.rs explicitly
configured with eviction_policy(EvictionPolicy::lru())).

The docstring now reflects two follow-up policies both real:
- ModeEvictFixedCache (reth's actual choice, more interesting structural
  option than LFU)
- ModeEvictLFU (W-TinyLFU; helps mainnet steady-state, not the cycle-2
  bloat fixtures which are pure cold scans)

Decision criterion (per agentspecs/lfu-vs-lru-state-cache-decision-2026-05-15.md):
ship ModeEvictLFU only if a 24h mainnet replay shows current sharded-LRU
hit-rate < 90 % on Account/Storage. Otter is the only credible Go
W-TinyLFU library; ristretto has documented correctness bugs and is
disqualified for an EL hot path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Investigation knob, NOT a permanent default. Account / Storage / Code
each capped at 100 MB so the bench measures layer contributions instead
of being dominated by preallocated cache memory pressure (1 GB / 1 GB /
512 MB defaults push sys past the GC/page-cache pressure band on this
hardware/workload mix).

Permanent defaults stay at 1 GB / 1 GB / 512 MB; this commit will be
reverted or dynamically gated by relative-to-available sizing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mh0lt mh0lt force-pushed the mh/perf-statecache-lru-pr branch from 266e297 to 4a512ce Compare May 25, 2026 07:29
mh0lt and others added 3 commits May 25, 2026 15:09
This PR ships the parallel-exec correctness fixes from
`mh/parallel-exec-fixes` onto the perf stack, packaged as a focused PR
on top of [#21386 (StateCache
LRU)](#21386) which itself
stacks on [#21380 (State Cache
Consolidation)](#21380).

> [!IMPORTANT]
> **Stacks on #21386#21380.** Base is `mh/perf-statecache-lru-pr`,
NOT `main`. Merge order: #21380#21386 → this PR.

> [!IMPORTANT]
> **Do not merge until CI is green on both parallel and serial.** Same
gating rule as #21380 / #21386.

## Scope — 13 commits from `mh/parallel-exec-fixes`

Brought in via a merge commit so the bisection trail is preserved.

| sha | what it fixes |
|---|---|
| `25053e38e9` | parallel SD-of-pre-existing-contract — the 197-line
foundational fix |
| `2e2bf3ccc0` | clean exit when single-block batch already covered
maxBlockNum |
| `6e451f5ed2` | don't emit StoragePath=0 writes from IBS.Selfdestruct |
| `616a4fa0a8` | clear calc Deleted on a non-SD account write even when
zero |
| `d99f2f704d` | gate known parallel-exec failures behind EXEC3_PARALLEL
(#21136) |
| `34e83e82b7` | install per-block changeset accumulator before any of
the block's writes |
| `b340d7e592` | drop stale sd.mem 'Trim old version entries' comment |
| `629cc23566` | O(1) CollectorWrites fee-balance update, drop dead
VersionedWrites.SetBalance |
| `a0ecfc7e12` | first-match-wins in CollectorWrites BalancePath index |
| `445f97e446` | emit EIP-7708 Burn log under parallel-exec when
coinbase self-destructs |
| `5e1f5fa901` | mirror ReadAccountData SD-revival check into
versionedRead |
| `a5dc83f509` | drop two stale EXEC3_PARALLEL t.Skips |
| `8af901104f` | drop TestReceiptHashFromRPC unit-suite RPC integration
test |

## Merge conflicts resolved

3 files, 8 regions — all resolved by keeping HEAD's typed-readset /
per-path revival shape and confirming HEAD already absorbs each fix's
intent. See the merge commit message (`cfc4ec1418`) for the per-region
rationale.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Mark Holt <erigon@dev-bm-e3-ethmainnet-n4.erigon.io>
…govet)

stateObject and s are both verified non-nil earlier in their respective
scopes; the secondary checks at lines 749 and 783 are redundant. govet
nilness check fails on these.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant