docs: add benchmarking blog posts and performance reference page by SamBarker · Pull Request #254 · kroxylicious/kroxylicious.github.io

SamBarker · 2026-05-13T04:00:17Z

Summary

Adds two blog posts about benchmarking Kroxylicious proxy overhead:
- [May 1] "Does my proxy look big in this cluster?" — operator-focused: methodology, passthrough and encryption results, sizing guidance
- [May 8] "Benchmarking a Kafka proxy: the engineering story" — engineer-focused: OMB harness, flamegraphs (interactive iframes), bugs found in own tooling, cluster incident
Adds a /performance/ reference page summarising key numbers and linking to both posts
Adds interactive async-profiler flamegraphs as self-contained HTML assets
Updates overview.markdown with headline performance figures and a link to the reference page

Status

Draft — the posts are first drafts. Known open items:

Per-connection scaling section in Post 1 needs the TODO placeholder replaced once 4-core sweep data is available and the scaling picture is better understood
Post 2 has a stub section for 4-core validation results (pending sweep completion)
Post 2 tone has not yet received the same voice treatment as Post 1

Test plan

Run ./run.sh and verify site renders at http://127.0.0.1:4000/
Check both blog posts render correctly including flamegraph iframes
Check /performance/ page renders with correct tables
Check cross-links between posts and to /performance/ work

🤖 Generated with Claude Code

Covers methodology, test environment, passthrough proxy results, encryption latency and throughput ceiling, the per-connection scaling insight, and sizing guidance. Includes a TODO placeholder for the connection sweep results before publication. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Covers why we chose OMB over Kafka's own tools, the benchmark harness we built (Helm chart, orchestration scripts, JBang result processors), workload design rationale, CPU flamegraphs with embedded interactive iframes, the per-connection ceiling discovery, bugs found in our own tooling, and the cluster recovery incident. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Adds /performance/ as a dedicated quick-reference page with headline benchmark numbers, comparison tables, and sizing guidance, linked from both blog posts. Updates the existing Performance section in overview.markdown with the key headline numbers and a link to the full reference page. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

tombentley · 2026-05-13T04:26:14Z

+| Kroxylicious proxy | 1.4% |
+| GC | 0.1% |
+
+The proxy is overwhelmingly I/O-bound. 59% of CPU is in `send`/`recv` syscalls — the inherent cost of maintaining two TCP connections (client→proxy, proxy→Kafka) with data flowing through the JVM. The proxy itself accounts for 1.4%. It really is a TCP relay with protocol awareness.


I wonder how much that's down to the decode predicate thing -- basically we know the filter chain, and what each filter in it wants to intercept, and I think we avoid doing the request/response decoding when we know nothing is interested. That was code that was in there from the beginning, but I don't actually know how relevant it is -- maybe some of the internal filters mean we're decoding requests and response always, in which case 1.4% is impressive. Or maybe we're acting more like a L4 proxy most of the time, in which case 1.4% is not quite as impressive.

Great question — this is actually a stronger story than the original prose suggested. The default infrastructure filters (BrokerAddressFilter, TopicNameCacheFilter, ApiVersionsIntersect) are doing genuine L7 work: metadata, FindCoordinator, and API version exchanges are fully decoded for address rewriting and version negotiation. But the high-volume produce/consume traffic hits the decode predicate and passes through without full deserialisation. So the proxy is selectively L7 — real protocol awareness where it needs it, L4-like passthrough on the hot path. The 1.4% is the cost of that design, and it validates it. Updating the prose to make this explicit.

tombentley · 2026-05-13T04:30:16Z

+
+The direct crypto cost is 13.3% (11.3% AES-GCM + 2.0% Kroxylicious filter logic). But encryption adds indirect costs too:
+
+- **Buffer management (+5.8%)**: encrypted records need to be read into buffers, encrypted, and written to new buffers — more allocation, more copying


Did we ever figure out how to reuse the buffers more? I think that was a TODO at one point.

Correct — the TODO was never addressed. A BufferPool class existed at one point but was deleted as unused in early 2024. Cipher instances are still created fresh per operation. These remain genuine open optimisation opportunities.

tombentley · 2026-05-13T04:33:10Z

+
+Fix: `kubectl uncordon worker0 worker1 worker2`. Once uncordoned, pods scheduled, operators recovered, and the upgrade completed.
+
+Not a Kroxylicious bug, but it cost several hours of cluster recovery time during an active benchmark campaign. Worth knowing about if you're running OCP on Fyre.


Given Fyre is an IBM internal thing, this is not terribly useful to all readers. Can we generalise it to being about OpenShift more generally?

…aming - Shift publication dates to May 21 and May 28 - Replace speculative per-connection ceiling explanation with empirical finding: encryption throughput ceiling scales linearly with CPU budget (validated at 1000m, 2000m, 4000m) - Add sizing formula: CPU (mc) = 20 × produce_MB_per_s, with worked example - Add RF=3 masking caveat: initial 1-topic sweeps conflated Kafka replication ceiling with proxy CPU ceiling; coefficient derived from RF=1 multi-topic workloads - Post 2: add full investigation narrative — workload isolation approach, coefficient derivation, 4-core confirmation, and 2-core prediction/validation - Drop stale "future work" items that are now complete Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

The proxy is selectively L7: default infrastructure filters do genuine Kafka protocol work (address rewriting, API version negotiation, metadata caching) while high-volume produce/consume traffic bypasses full deserialisation via the decode predicate. The 1.4% proxy CPU share validates this design, not just reflects it. Also drop the Fyre cluster upgrade section — OCP-internal incident with no relevance to readers. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

- Warm up test environment intro: realistic deployment framing - Add conversational lead-in to sizing guidance in both documents - Improve caveats opener in Post 1 - Add caveats section to performance page (RF=3 masking, message size, horizontal scaling) Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

SamBarker added 3 commits May 1, 2026 16:24

tombentley reviewed May 13, 2026

View reviewed changes

SamBarker added 3 commits May 15, 2026 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add benchmarking blog posts and performance reference page#254

docs: add benchmarking blog posts and performance reference page#254
SamBarker wants to merge 6 commits into
kroxylicious:mainfrom
SamBarker:blog/benchmarking-the-proxy

SamBarker commented May 13, 2026

Uh oh!

tombentley May 13, 2026

Uh oh!

SamBarker May 15, 2026

Uh oh!

tombentley May 13, 2026

Uh oh!

SamBarker May 15, 2026

Uh oh!

tombentley May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		The direct crypto cost is 13.3% (11.3% AES-GCM + 2.0% Kroxylicious filter logic). But encryption adds indirect costs too:

		- Buffer management (+5.8%): encrypted records need to be read into buffers, encrypted, and written to new buffers — more allocation, more copying


		Fix: `kubectl uncordon worker0 worker1 worker2`. Once uncordoned, pods scheduled, operators recovered, and the upgrade completed.

		Not a Kroxylicious bug, but it cost several hours of cluster recovery time during an active benchmark campaign. Worth knowing about if you're running OCP on Fyre.

Conversation

SamBarker commented May 13, 2026

Summary

Status

Test plan

Uh oh!

tombentley May 13, 2026

Choose a reason for hiding this comment

Uh oh!

SamBarker May 15, 2026

Choose a reason for hiding this comment

Uh oh!

tombentley May 13, 2026

Choose a reason for hiding this comment

Uh oh!

SamBarker May 15, 2026

Choose a reason for hiding this comment

Uh oh!

tombentley May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants