Skip to content

besu: opt-in Project Leyden AOT-cache image for benchmarking#387

Open
qu0b wants to merge 1 commit into
masterfrom
qu0b/besu-aot-cache
Open

besu: opt-in Project Leyden AOT-cache image for benchmarking#387
qu0b wants to merge 1 commit into
masterfrom
qu0b/besu-aot-cache

Conversation

@qu0b
Copy link
Copy Markdown
Member

@qu0b qu0b commented May 26, 2026

What

Adds besu/aot/ which, when BESU_BUILD_AOT=true, generates a Project Leyden AOT cache from the freshly-built besu image and bakes it into a derivative ethpandaops/besu:<tag>-aot image. The JVM then starts pre-warmed, so Besu isn't penalised for JIT/C2 warmup in benchmarkoor runs.

  • besu/aot/generate-aot.sh — runs a container from the just-built image with BESU_OPTS=-XX:AOTCacheOutput=… against a finite training workload (default besu blocks import), then builds + pushes the -aot image.
  • besu/aot/DockerfileFROM that exact image, COPYs the cache to /opt/besu/aot/besu.aot, defaults BESU_OPTS=-XX:AOTCache=….
  • besu/build.sh — calls it only when BESU_BUILD_AOT=true. Default builds are unchanged (opt-in).

Why this resolves the version chicken-and-egg

The Besu team's concern was that baking the cache into a new image creates a new "version" that invalidates it. It doesn't: the Leyden cache is validated against the besu classpath/jar, not docker layers. The derivative FROMs the precise image the cache was recorded against and only adds a data file, so the jar is byte-identical and the cache stays valid.

Measured by the Besu team on bal-devnet-7: first block 32.9 → 159.5 Mgas/s, warm block 154.7 → 233.8 Mgas/s. Benchmarking only — not a mainnet recommendation.

Validation

Tested end-to-end on amd64 (Temurin 25.0.3):

  • -XX:AOTCacheOutput records a cache from the published ethpandaops/besu:bal-devnet-7.
  • Derivative image builds; COPY --chown=besu:besu OK.
  • On startup: Opened AOT cache /opt/besu/aot/besu.aotUsing AOT-linked classes: true.
  • generate-aot.sh run end-to-end with BESU_AOT_PUSH=false (build-only) → rc 0.

This validates the plumbing; warmup quality depends on a representative training corpus.

Known follow-ups (intentionally out of scope)

  • CI doesn't enable it yet — the besu workflow / deploy action don't set BESU_BUILD_AOT or pass arbitrary env, so the -aot image is only produced by a manual build or a later CI change. Opt-in/default-off keeps CI safe today.
  • Training corpus is a real choice — supply via BESU_AOT_BLOCKS + BESU_AOT_GENESIS, or BESU_AOT_TRAIN_CMD. Script fails fast without it.
  • Multi-arch — the manifest job doesn't stitch an -aot manifest; treat <tag>-aot as amd64 (benchmarks run amd64).

Consumed in benchmarkoor-tests via a besu-bal-*-aot instance (separate PR).

Adds besu/aot/ which, when BESU_BUILD_AOT=true, generates a Leyden AOT
cache from the freshly-built besu image and bakes it into a derivative
ethpandaops/besu:<tag>-aot image. The JVM then starts pre-warmed, so Besu
isn't penalised for JIT/C2 warmup in benchmarkoor runs.

The derivative FROMs the exact image the cache was recorded against and
only COPYs in a data file, so the jar is byte-identical and the cache
stays valid -- this resolves the version chicken-and-egg the Besu team
flagged. Default builds are unchanged (opt-in).

Training corpus is supplied via BESU_AOT_BLOCKS/BESU_AOT_GENESIS (or a
full BESU_AOT_TRAIN_CMD override); the script fails fast without it.
@qu0b-reviewer
Copy link
Copy Markdown

qu0b-reviewer Bot commented May 26, 2026

🤖 qu0b-reviewer

The code is clean. The trap correctly accumulates cleanup (both the temp dir and the transient besu.aot work copy). All exit paths are accounted for. The implementation is sound.

Summary

PR #387 adds opt-in Project Leyden AOT-cache support for Besu, targeting benchmarking scenarios where JVM warmup penalizes Besu vs. native clients. It introduces:

  • besu/aot/Dockerfile — thin derivative that bakes in the AOT cache file and sets BESU_OPTS to load it by default.
  • besu/aot/generate-aot.sh — orchestrates the training run (JVM records the cache on finite workload exit), then builds and pushes the -aot tag.
  • besu/build.sh — appends a conditional call to generate-aot.sh when BESU_BUILD_AOT=true.
  • besu/aot/README.md — documents the rationale, chicken-and-egg resolution, caveats, and local usage.

The implementation is correct: the Dockerfile FROMs the precise image used for training, so the jar is byte-identical and the cache stays valid (explained clearly in the docblock). The trap pattern correctly accumulates cleanup across two stages. source_git_commit_hash flows correctly from CI → build.shgenerate-aot.sh for pinned-tag creation. Warnings in the README about arch/JDK specificity; multi-arch manifest limitations; and the benchmarking-only intent are all appropriate disclosures.

One pre-integration concern: the manifest job silently skips -aot tags (generated outside the standard deploy action). This isn't a bug in this PR — the README already calls it out — but whoever enables BESU_BUILD_AOT=true in CI needs to be aware that the manifest job will only stitch the non-AOT multi-arch manifest.

Suggestions

  • besu/aot/generate-aot.sh — the second trap uses ; without a space before rm -f, making it runable as a single compound command, but the first trap also uses ;. The quoting is consistent. No issue — just noting it follows the existing script style.
  • CI enablement — when BESU_BUILD_AOT=true is flipped on in production CI, the manifest job will silently ignore -aot tags. Consider either documenting this limitation in the workflow YAML comment, or adding a note to the manifest job that it does not cover AOT images.

Reviewed @ a523a533
"Every system eventually becomes legacy."

Comment thread besu/aot/README.md
1. `besu/build.sh` builds and pushes the normal image as today.
2. When `BESU_BUILD_AOT=true`, it then calls `besu/aot/generate-aot.sh`, which:
- runs a container from the **just-built image** with
`BESU_OPTS=-XX:AOTCacheOutput=/aot/besu.aot` against a finite training
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a finite training workload ?

Comment thread besu/aot/README.md

The Besu team's concern was that baking the cache into a new image creates a new
"version" that invalidates the cache. It does not: the Leyden cache is validated
against the **besu classpath/jar**, not against docker layers. Because the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building besu to generate a docker image will generate new jars for some of project dependencies, ex. evmTool : besu-evmTool:26.5-develop-1612ec5. So the version of the jar that was referenced by besu during training will be different from the one used during execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants