Skip to content

db/state: cache findShortenedKey per commitmentValTransformDomain merge#21303

Open
yperbasis wants to merge 3 commits into
mainfrom
yperbasis/merge-shortkey-cache
Open

db/state: cache findShortenedKey per commitmentValTransformDomain merge#21303
yperbasis wants to merge 3 commits into
mainfrom
yperbasis/merge-shortkey-cache

Conversation

@yperbasis
Copy link
Copy Markdown
Member

@yperbasis yperbasis commented May 20, 2026

Summary

Memoise findShortenedKey lookups within a single commitmentValTransformDomain closure. The merged storage and account files are read-only for the lifetime of the closure, so (fullKey → offset) is invariant — and on mainnet, hot contracts (USDC, WETH, Uniswap routers, etc.) recur across many commitment branches in the same merge range, so the same full keys are hashed into the same merged B-tree many times.

Why

CPU profiling on mainnet tip identified DomainRoTx.commitmentValTransformDomain → BranchData.ReplacePlainKeys → findShortenedKey → BtIndex.Get as ~52 % of merge CPU during [snapshots] aggregated bursts. That's the per-branch plain-key-reference re-pointing pass — for every branch in the merge range, every account/storage key referenced inside it has to be re-located in the newly-merged file's B-tree.

Verified locally on mainnet (live tip, single-step aggregation)

Same datadir, same NVMe, back-to-back runs:

metric unpatched (baseline) patched Δ
[snapshots] aggregated took= (single step) 10.6 s / 11.3 s 9.3 s −15-18 %
concurrent FCU commit= peak during burst 569 ms 314 ms −45 %
DomainRoTx.mergeFiles cum CPU during burst 82 % 28.8 % −65 pp
commitmentValTransformDomain cum CPU 62 % 21.6 % −65 %

Inside the replacer closure post-patch, runtime.mapaccess2_faststr (cache hits) is 19.94 % vs findShortenedKey (cache misses) 76.79 % — most lookups now hit the per-merge map. findShortenedKey → BtIndex.Get no longer appears in the top-25 cumulative.

Wall-time only drops ~15 % even though commitmentValTransformDomain falls ~65 % because the merge runs multiple domains in parallel via errgroup.Go — commitment is one domain among several. The per-merge CPU pressure reduction is what shows up in the FCU side: the burst's amplification of concurrent FCU commits goes from ~+370 ms over baseline (569 vs ~200) to ~+115 ms (314 vs ~200).

Implementation notes

  • Two per-closure map[string]uint64 instances (storage and account). No cross-merge state.
  • String conversion string(auxBuf) copies bytes into the map key — safe against the reused keyBuf slice.
  • Lost-key path left uncached (Crit-logged error condition, rare).
  • No size bound: cache cardinality is bounded naturally by the number of unique plain keys in the merge range. For an 8-step or 16-step cascading merge the cache could grow to a few hundred MB worst-case; if that becomes a concern in production, an LRU is a one-line follow-up.

Test plan

  • go test -short ./db/state/... passes
  • make lint clean (twice — non-deterministic)
  • make erigon builds
  • Live mainnet at tip — verified aggregation took= dropped and concurrent FCU peak commit dropped (numbers above)

🤖 Generated with Claude Code

…main merge

Profiling tip-mode `[snapshots] aggregated step=X took=Ys` bursts on mainnet
identified `DomainRoTx.commitmentValTransformDomain → BranchData.ReplacePlainKeys`
→ findShortenedKey → BtIndex.Get as ~52 % of merge CPU. The merged storage
and account files are searched once per plain-key reference per commitment
branch, and on mainnet hot contracts (USDC, WETH, Uniswap routers, etc.)
recur across many branches in the same merge range — same full key hashed
many times into the same merged file's B-tree.

Add two per-closure maps that memoise the storage- and account-key lookups
within one commitmentValTransformDomain invocation. The merged file is
read-only for the lifetime of the closure, so (fullKey → offset) is
invariant; cache reset isn't needed and bounding the size would only matter
for very high-cardinality merge ranges (unbounded growth is bounded
naturally by the unique-key count in the input). Lost-key path is left
uncached since it's a Crit-logged error condition.

Aims to shrink aggregation wall time (10-15s observed on this hardware) and
therefore the FCU latency elevation that overlaps it.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes commitment-domain merges by memoizing findShortenedKey results within a single commitmentValTransformDomain transformer, reducing repeated B-tree lookups when the same account/storage plain keys recur across many commitment branches in the same merge range.

Changes:

  • Add per-merge in-memory caches for (fullKey → offset) lookups for both storage and account keys.
  • Use the caches inside the branch value replacer to avoid repeated findShortenedKey → BtIndex.Get calls.
Comments suppressed due to low confidence (1)

db/state/domain_committed.go:410

  • In the account replacer, the Crit log prints hex.EncodeToString(shortened), but shortened is the shared output buffer and may not correspond to the key being processed (it can contain a previous replacement). For accurate diagnostics, log the input key (and keep auxBuf as the full key) instead.
					dt.d.logger.Crit("valTransform: replacement for full account key was not found",
						"step", fmt.Sprintf("%d-%d", keyFromTxNum/dt.d.stepSize, keyEndTxNum/dt.d.stepSize),
						"shortened", hex.EncodeToString(shortened), "toReplace", hex.EncodeToString(auxBuf))
					return nil, fmt.Errorf("replacement not found for account  %x", auxBuf)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/state/domain_committed.go Outdated
Comment thread db/state/domain_committed.go
…nsformDomain Crit

The "shortened" field in the "replacement not found" Crit logs was printing
the function-scope `shortened` buffer, which is overwritten on every successful
EncodeReferenceKey call and therefore holds the previous successful replacement
(or the zero-initialised buffer) at the moment the failure log fires — unrelated
to the key being diagnosed. Log the input `key` (the shortened reference from
the branch) instead. Pre-existing issue, surfaced by Copilot review on #21303.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as ready for review May 20, 2026 12:34
@yperbasis yperbasis requested a review from awskii May 20, 2026 12:34
@yperbasis yperbasis requested review from mh0lt, taratorio and wmitsuda May 20, 2026 13:03
Copy link
Copy Markdown
Collaborator

@Giulio2002 Giulio2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — bounded maintenance change (63 lines)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants