diff --git a/_posts/2026-05-21-benchmarking-the-proxy.md b/_posts/2026-05-21-benchmarking-the-proxy.md new file mode 100644 index 00000000..c9602299 --- /dev/null +++ b/_posts/2026-05-21-benchmarking-the-proxy.md @@ -0,0 +1,156 @@ +--- +layout: post +title: "Does my proxy look big in this cluster?" +date: 2026-05-21 00:00:00 +0000 +author: "Sam Barker" +author_url: "https://github.com/SamBarker" +categories: benchmarking performance +--- + +All good benchmarking stories start with a hunch. Mine was that Kroxylicious is cheap to run — I'd stake my career on it, in fact — but it turns out that "trust me, I wrote it" is not a widely accepted unit of measurement. People want proof. Sensibly. + +There's a practical question underneath the hunch too. The most common thing operators ask us is some variation of: "How many cores does the proxy need?" Which is really just "is this thing going to slow down my Kafka?" in a polite engineering hat. We'd been giving the classic answer: "it depends on your workload and traffic patterns, so you'll need to test in your environment." Which is true. And also deeply unsatisfying for everyone involved, including us. + +So we stopped saying "it depends", and got off the fence: we built something you can run **yourselves** on your own infrastructure with your own workload, and measured it. Here are some representative numbers from ours. + +## What we measured + +We ran three scenarios against the same Apache Kafka® cluster on the same hardware: + +- **Baseline** — producers and consumers talking directly to Kafka, no proxy in the path +- **Passthrough proxy** — traffic routed through Kroxylicious with no filter chain configured +- **Record encryption** — traffic through Kroxylicious with AES-256-GCM record encryption enabled, using HashiCorp Vault as the KMS + +We used [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) rather than Kafka's own `kafka-producer-perf-test`. OMB is an industry-standard tool that coordinates producers and consumers together, measures end-to-end latency (not just publish latency), and produces structured JSON that makes comparison straightforward. More on why we built a whole harness around it in the [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}). + +## Test environment + +No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. Kroxylicious ran as a single proxy pod with a 1000m CPU limit. + +| Component | Details | +|-----------|---------| +| CPU | AMD EPYC-Rome, 2 GHz | +| Cluster | 6-node OpenShift, RHCOS 9.6 | +| Kafka | 3-broker Strimzi cluster, replication factor 3 | +| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit | +| KMS | HashiCorp Vault (in-cluster) | + +The primary workload used 1 topic, 1 partition, 1 KB messages. We chose single-partition deliberately: it concentrates all traffic on one broker, so you hit ceilings quickly and any proxy overhead is easy to isolate. We also ran 10-topic and 100-topic workloads to make sure the results hold when load is spread more realistically across brokers. + +One important caveat: this Kafka cluster is deliberately untuned. We're not trying to squeeze every message-per-second out of Kafka — we're using it as a fixed baseline to measure what the proxy adds on top. Kafka experts will find obvious headroom to improve on our baseline numbers; that's fine and expected. The deltas are what matter here, not the absolutes. + +--- + +## The passthrough proxy: negligible overhead + +Good news first. The proxy itself — with no filter chain, just routing traffic — adds almost nothing. + +**10 topics, 1 KB messages (5,000 msg/sec per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.62 ms | 2.79 ms | +0.17 ms (+7%) | +| Publish latency p99 | 14.09 ms | 15.17 ms | +1.08 ms (+8%) | +| E2E latency avg | 94.87 ms | 95.34 ms | +0.47 ms (+0.5%) | +| E2E latency p99 | 185.00 ms | 186.00 ms | +1.00 ms (+0.5%) | +| Publish rate | 5,002 msg/s | 5,002 msg/s | 0 | + +**100 topics, 1 KB messages (500 msg/sec per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.66 ms | 2.82 ms | +0.16 ms (+6%) | +| Publish latency p99 | 5.54 ms | 6.07 ms | +0.53 ms (+10%) | +| E2E latency avg | 253.16 ms | 253.76 ms | +0.60 ms (+0.2%) | +| E2E latency p99 | 499.00 ms | 499.00 ms | 0 | +| Publish rate | 500 msg/s | 500 msg/s | 0 | + +**The headline: ~0.2 ms additional average publish latency. Throughput is unaffected.** + +What did I take away from this entirely unsurprising result? Not much, honestly — without filters the proxy is little more than a couple of hops through the TCP stack, but we now have data rather than a hunch. +The end-to-end (E2E) p99 figure is dominated by the Kafka consumer fetch timeouts, as it should be. That said, it is reassuring to have a sub-ms impact on the p99. + +--- + +## Record encryption: now we're doing real work + +Ok, so let's make the proxy smarter — make it do something people actually care about! [Record encryption](https://kroxylicious.io/documentation/0.20.0/html/record-encryption-guide) uses AES-256-GCM to encrypt each record passing through the proxy. AES-256-GCM is going to ask the CPU to work relatively hard on its own, but it's also going to push the proxy to understand each record it receives, unpack it, copy it, encrypt it, and re-pack it before sending it on to the broker. With all that work going on we expect some impact to latency and throughput. To answer our original question we need to identify two things: the latency when everything is going smoothly, and the reduction in throughput all this work causes. Monitoring latency once we go past the throughput inflection point isn't very helpful — it's dominated by the throughput limits and their erratic impacts on the latency of individual requests (a big hello to batching and buffering effects). + +### Latency at sub-saturation rates + +A quick note on percentiles for anyone not steeped in performance benchmarking: p99 latency is the value that 99% of requests complete within — meaning 1 in 100 requests takes longer. Averages flatter; the p99 is what your slowest clients actually experience, and it's usually the number that matters. + +So we know encryption is doing a lot of work, but to find out the real impact we need to compare it to a plain Kafka cluster (and yes, people do run Kroxylicious without filters — TLS termination, stable client endpoints, virtual clusters — but that's a different post). The table below tells us that above a certain inflection point the numbers get really, really noisy — especially in the p99 range. + +**1 topic, 1 KB messages — baseline vs encryption:** + +| Rate | Metric | Baseline | Encryption | Delta | +|------|--------|----------|------------|-------| +| 34,000 msg/s | Publish avg | 8.00 ms | 8.19 ms | +0.19 ms (+2%) | +| 34,000 msg/s | Publish p99 | 48.65 ms | 64.01 ms | +15.35 ms (+32%) | +| 36,000 msg/s | Publish avg | 9.38 ms | 10.46 ms | +1.08 ms (+12%) | +| 36,000 msg/s | Publish p99 | 63.92 ms | 88.98 ms | +25.06 ms (+39%) | +| 37,200 msg/s | Publish avg | 9.12 ms | 12.19 ms | +3.07 ms (+34%) | +| 37,200 msg/s | Publish p99 | 74.88 ms | 113.15 ms | +38.27 ms (+51%) | + +So we know that somewhere above 34k we're hitting a limit. Time to hunt out exactly where — enter the rate-sweep. + +### Throughput ceiling + +A rate-sweep is exactly what it sounds like: pick a starting rate, let OMB run long enough to get a stable measurement, then step up by a fixed percentage and repeat until the system can't keep up. We defined "can't keep up" as the sustained throughput dropping by more than 5% below the target rate — at that point, something has saturated. + +We started at 34k (right where the latency table started getting interesting) and stepped up in 5% increments. The results: + +- **Baseline**: sustained up to ~50,000–52,000 msg/sec (the ceiling we observed on our test cluster) +- **Encryption**: sustained up to **~37,200 msg/sec**, then started intermittently saturating +- **Cost: approximately 26% fewer messages per second per partition** + +The transition wasn't a clean cliff edge — between 37,600 and 42,000 msg/sec the proxy alternated between sustaining and saturating. That pattern is characteristic of running right at a limit: it's not that it suddenly falls over, it's that small fluctuations (GC pauses, scheduling jitter) are enough to tip it either way. Above ~39,000 msg/sec, p99 latency regularly spiked above 1,700 ms. Stay below 37k and you're fine. Creep above it and you'll notice. The numbers are not absolute — they are just what we measured on our cluster; your mileage **will vary**. + +### The ceiling scales with CPU budget + +The fact the proxy is low latency didn't surprise me, but this did — and it matters when we think about scaling. We maxed out a single connection, but that didn't mean we'd maxed out the proxy. + +Once we had the single-producer encryption ceiling at ~37k msg/sec, the obvious question was: is that the limit for the whole proxy pod, or just for one connection? We ran the same test with 4 producers. With 4 connections the proxy sustained well past the single-producer ceiling — proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. + +Going further: we swept the same workload at 1000m, 2000m, and 4000m CPU. The throughput ceiling scaled linearly with the CPU budget — 1000m at ~40k msg/sec, 2000m at ~80k, 4000m at ~160k. The proxy isn't hitting a fixed architectural wall; it's hitting a CPU budget wall, and that wall moves when you give it more CPU. + +**The practical implication**: the throughput ceiling is not a fixed number — it's a function of the CPU you allocate. Set `requests` equal to `limits` in your pod spec; this makes the CPU budget deterministic and the ceiling predictable. The companion engineering post has the full story of how we found this, including the workload design choices needed to isolate proxy CPU from Kafka's own limits. + +--- + +## Sizing guidance + +Numbers without guidance aren't very useful, so here's how to translate these results into pod specs. + +**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy won't be the bottleneck — but if you want to verify that on your own hardware, the rate sweep is exactly the tool for it. Run the baseline and passthrough scenarios back-to-back and you'll have your own numbers. + +**With record encryption:** + +1. **Throughput budget**: encryption imposes a CPU-driven throughput ceiling. As a planning formula: + + > **`proxy CPU (millicores) = 20 × produce throughput (MB/s)`** + + Add ×1.3 headroom for GC pauses and burst. This assumes matched consumer load (1:1 produce:consume) and was measured on AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your own hardware using the rate sweep. + + Worked example: 100k msg/s at 1 KB = 100 MB/s produce → 100 × 20 = 2000m, plus headroom → ~2600m (~2.6 cores). + +2. **Latency budget**: well below saturation, expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99. The overhead scales with how hard you're pushing — give yourself headroom and you'll barely notice it. + +3. **Scaling**: set `requests` equal to `limits` in your pod spec — this makes the CPU budget deterministic, which makes the throughput ceiling predictable. To increase throughput, raise the CPU limit. For redundancy, add proxy pods. + +4. **KMS overhead**: DEK caching means Vault isn't on the hot path for every record. Our tests triggered only 5–19 DEK generation calls per benchmark run. The KMS is not the thing to worry about. + +--- + +## Caveats and next steps + +These are real results from real hardware, but they don't tell a story for your workload. A few things worth knowing before you put these numbers in a slide deck: + +- **Message size**: all results use 1 KB messages. The coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages. +- **Replication factor**: the 1-topic rate sweep ran at RF=3. At that replication factor, Kafka's ISR replication traffic creates a per-partition ceiling that sits close to where proxy CPU also saturates — the two limits are entangled in those results. The sizing coefficient was derived from RF=1 multi-topic workloads specifically to isolate proxy CPU. The [companion engineering post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}) has that detail. +- **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod horizontal scaling hasn't been measured but is expected to follow the same coefficient. + +For the engineering story — why we built a custom harness on top of OMB, what the CPU flamegraphs actually show, and the bugs we found in our own tooling along the way — that's in the [companion post]({% post_url 2026-05-28-benchmarking-the-proxy-under-the-hood %}). + +The full benchmark suite, quickstart guide, and sizing reference are in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). diff --git a/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md new file mode 100644 index 00000000..40f2b9ce --- /dev/null +++ b/_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md @@ -0,0 +1,249 @@ +--- +layout: post +title: "How hard can it be??? Maxing out a Kroxylicious instance" +date: 2026-05-28 00:00:00 +0000 +author: "Sam Barker" +author_url: "https://github.com/SamBarker" +categories: benchmarking performance engineering +--- + +How hard can it be? We started with a laptop, a codebase, and a lot of confidence it was fast. We ended up with a benchmark harness, a six-node cluster, and a much more nuanced answer. + +Harder than expected. More interesting too. + +We gave everyone [the numbers]({% post_url 2026-05-21-benchmarking-the-proxy %}) in a bland, but slide worthy way, already. This one is the engineering story: how we built the harness, what the flamegraphs actually show, the workload design choices that changed the answers, and the bugs we found in our own tooling. + +## Why not Kafka's own tools? + +Kafka ships with `kafka-producer-perf-test` and `kafka-consumer-perf-test`. We'd used them before. The problems: + +- **Too noisy**: individual runs produced widely varying results depending on JVM warm-up, scheduling jitter, and GC behaviour. Results were hard to trust and harder to compare across scenarios. +- **Producer-only view**: `kafka-producer-perf-test` gives you publish latency, but nothing about the consumer side. You can't see end-to-end latency — which is something operators actually care about. +- **Awkward to sweep**: running parametric rate sweeps requires scripting around these tools, and comparing results across scenarios requires manual work. +- Coordinated omission: under load, kafka-producer-perf-test only measures requests it actually sends! So when things start loading up and applying back pressure the send rate drops and the latency stays looking nice and healthy. Only it's not healthy in reality, things are queuing up in your producer. + +And critically, it's never heard of Kroxylicious... You have though, you're here! + +[OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark) is a better fit. It's an industry-standard tool used by Confluent, the Pulsar team, and others for their published performance comparisons - so who am I to argue? OMB coordinates producers and consumers across separate worker pods, runs a configurable warmup phase before taking measurements, takes its latency tracking seriously by tracking coordinated omission, and outputs structured JSON that's straightforward to process programmatically. What's not to like? + +Using OMB also means our methodology is directly comparable to other published Kafka benchmarks. The numbers aren't comparable of course it's not the same hardware, network conditions or phase of the moon. + +## What we built on top of OMB + +So we just fire up OMB and get some numbers, right? Errr no. OMB just does the measurement part. I work really hard at being lazy, I hate clicking things with a mouse and I knew these tests needed to be repeatable. So we scripted deployment (of all the things) teardown (for isolation), diagnostic collection (WHAT BROKE NOW??), and last but not least result processing (what does this wall of JSON mean?) + +So now all of that lives in [`kroxylicious-openmessaging-benchmarks`](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks) in the main tree (mono repo FTW). + +### Helm chart + +A Helm chart (`helm/kroxylicious-benchmark/`) deploys the full benchmark stack into Kubernetes: + +- OMB coordinator and worker pods +- A Strimzi Kafka cluster - deploying Kafka on K8s what else are you going to use? (answers to /dev/null) +- The Kroxylicious operator +- The Kroxylicious proxy +- HashiCorp Vault (for the KMS in the encryption scenario). Importantly if you have your own KMS (and you will run this yourself for your workload, right?!) you can plug that in instead. + +Scenario-specific configuration lives in `helm/kroxylicious-benchmark/scenarios/` as YAML overrides: + +| Scenario file | What it deploys | +|---------------|-----------------| +| `baseline-values.yaml` | Direct Kafka, no proxy | +| `proxy-no-filters-values.yaml` | Proxy with no user filters | +| `encryption-values.yaml` | Proxy with AES-256-GCM encryption and Vault | +| `rate-sweep-values.yaml` | Extended run profiles for sweep experiments | + +Separating scenarios into override files means the base chart stays stable while each scenario adds only what it needs. Switching between scenarios doesn't require touching the chart itself. + +### Orchestration scripts + +**`scripts/run-benchmark.sh`** orchestrates a single benchmark run: + +1. Deploys the Helm chart for the requested scenario +2. Waits for the OMB Job to complete +3. Collects results: OMB JSON, a JFR recording, an async-profiler flamegraph, and a Prometheus metrics snapshot +4. Tears down + +The `--skip-deploy` flag lets you re-run a probe against an already-deployed cluster — essential for rate sweeps where you want to deploy once and probe many times. + +**`scripts/rate-sweep.sh`** wraps `run-benchmark.sh` to drive parametric sweeps. It takes `--min-rate`, `--max-rate`, `--step-percent`, and one or more `--scenario` flags. The first probe deploys; subsequent probes use `--skip-deploy`. + +### Result processing + +Three JBang-runnable Java programs handle result analysis: + +- **`RunMetadata.java`**: generates `run-metadata.json` alongside each result. Captures git commit, timestamp, cluster node specs (architecture, CPU, RAM), and — on OpenShift — NIC speed read from the host via the MachineConfigDaemon pod. +- **`ResultComparator.java`**: reads two scenario result directories and produces a markdown comparison table. +- **`ResultSummariser.java`**: reads a rate-sweep result directory and prints a saturation table: target rate, achieved rate, p99, and whether the probe saturated. + +Getting NIC speed from a Kubernetes node turned out to be non-trivial — you need host filesystem access to read `/sys/class/net//speed`. On OpenShift, the MachineConfigDaemon pods mount the host at `/rootfs`, so we `kubectl exec` into the MCD pod and `chroot /rootfs` to read the speed file without creating any new privileged resources. + +## Workload design + +The primary workload used **1 topic, 1 partition, 1 KB messages**. This is deliberate. Concentrating all traffic on a single partition pushes things to their limits at lower absolute rates, which makes the proxy overhead easier to isolate: when the system saturates, it's the proxy, not a spread-out broker fleet. + +Multi-topic workloads (10 topics, 100 topics) were used to verify that the overhead characteristics hold when load is distributed. At 5,000 msg/sec per topic across 10 topics, every topic-partition pair is well below any saturation point — so what you're measuring is steady-state overhead, not ceiling behaviour. + +For throughput ceiling testing we used rate sweeps: start at 34,000 msg/sec, step up by 5% until achieved rate drops below 95% of target. The knee of that curve is the saturation point. + +## The flamegraph: where the CPU actually goes + +We captured CPU profiles using async-profiler attached to the proxy JVM via `jcmd JVMTI.agent_load`, during the steady-state measurement phase at 36,000 msg/sec. These are self-time percentages — where the CPU is actually spending cycles, not inclusive call-tree time. + +The flamegraphs below are fully interactive: hover over a frame to see its name and percentage, click to zoom in, Ctrl+F to search. Scroll within the frame to explore the full stack depth. + +### No-filter proxy + +
+ +
CPU flamegraph — passthrough proxy (no filters), 36,000 msg/sec, 1 topic, 1 KB messages. Open full screen ↗
+
+ +| Category | CPU share | +|----------|-----------| +| Syscalls (send/recv) | 59.2% | +| Native/VM | 16.7% | +| Netty I/O | 10.5% | +| Memory operations | 4.7% | +| JDK libraries | 2.9% | +| Kroxylicious proxy | 1.4% | +| GC | 0.1% | + +The proxy is overwhelmingly I/O-bound. 59% of CPU is in `send`/`recv` syscalls — the inherent cost of maintaining two TCP connections (client→proxy, proxy→Kafka) with data flowing through the JVM. The proxy itself accounts for 1.4% — and understanding *why* that number is so small is the interesting part. + +Kroxylicious decodes Kafka RPCs selectively: each filter declares which API keys it cares about, and the proxy only deserialises messages that at least one filter needs. Even in the no-filter scenario, the default infrastructure filters are doing genuine L7 work — broker address rewriting, API version negotiation, topic name caching — which means metadata, FindCoordinator, and API version exchanges are fully decoded. But the high-volume produce and consume traffic? The decode predicate skips full deserialisation for those entirely, passing them through at close to L4 speed. + +The 1.4% is the cost of a proxy that is *selectively* L7: doing real Kafka protocol work where it matters, and treating the hot path like a TCP relay where it doesn't. That's not a side-effect — it's what the decode predicate design is for, and this flamegraph validates it. + +### Encryption proxy (same 36,000 msg/sec rate) + +
+ +
CPU flamegraph — encryption proxy (AES-256-GCM), 36,000 msg/sec, 1 topic, 1 KB messages. Open full screen ↗
+
+ +| Category | No-filters | Encryption | Delta | +|----------|-----------|------------|-------| +| Syscalls (send/recv) | 59.2% | 23.5% | −35.7%* | +| Native/VM | 16.7% | 18.9% | +2.2% | +| JCA/AES-GCM crypto | 0.0% | 11.3% | **+11.3%** | +| Memory operations | 4.7% | 10.4% | **+5.8%** | +| JDK libraries | 2.9% | 9.3% | **+6.4%** | +| GC / JVM housekeeping | 0.1% | 5.0% | **+4.9%** | +| Netty I/O | 10.5% | 5.1% | −5.4%* | +| Kafka protocol re-encoding | 0.4% | 3.5% | **+3.1%** | +| Kroxylicious encryption filter | 0.0% | 2.0% | **+2.0%** | + +*\* Send/recv and Netty I/O appear to shrink as a percentage share because encryption adds CPU work that grows the total pie. The absolute I/O cost is similar in both scenarios.* + +The direct crypto cost is 13.3% (11.3% AES-GCM + 2.0% Kroxylicious filter logic). But encryption adds indirect costs too: + +- **Buffer management (+5.8%)**: encrypted records need to be read into buffers, encrypted, and written to new buffers — more allocation, more copying +- **GC pressure (+4.9%)**: more short-lived objects from encryption buffers and crypto operations +- **JDK security infrastructure (+6.4%)**: security provider lookups, key spec handling, parameter generation +- **Kafka protocol re-encoding (+3.1%)**: encrypted records are different sizes and must be re-serialised into Kafka protocol format + +Total additional CPU: ~33%. This aligns closely with the ~26% throughput reduction. + +If you wanted to optimise this, the highest-impact areas would be: reducing buffer copies (encrypt in-place or use composite buffers), pooling encryption buffers to reduce GC pressure, and caching `Cipher` instances to reduce per-record JDK security overhead. + +## Following the ceiling + +### A problem with the workload + +The single-producer rate sweep hit a ceiling at ~37k msg/sec. Before drawing conclusions, we had to ask whether that was actually a proxy CPU ceiling — or something else. + +Our initial sweeps ran with replication factor 3, the standard production default. At RF=3, every message the Kafka leader receives goes out to 2 follower replicas. With 1 KB messages and 37k msg/sec, that's ~37 MB/s inbound to the leader and ~111 MB/s total replication traffic outbound — and the Fyre cluster nodes had 10 GbE NICs, so the ceiling wasn't the NIC. But RF=3 does create a real per-partition I/O ceiling on the Kafka leader, and it sits right around where we were measuring. + +The fix: RF=1, 10-topic workload. Dropping to RF=1 removes replication overhead; spreading across 10 partitions distributes load so no single partition hits its ceiling. We validated the fix with the passthrough proxy scenario: at 160k msg/sec total (16k per topic), proxy-no-filters matched baseline — Kafka was not the bottleneck. The sweep scaled to 640k msg/sec before hitting some uninvestigated ceiling well above where encryption constrains anything. + +### Is the encryption ceiling per-pod or per-connection? + +With a clean workload that isolates proxy CPU, we re-examined the ~37k figure. Running the same workload with 4 producers: proxy CPU had headroom to spare, and Kafka's partition became the bottleneck first. So the single-producer ceiling is not the pod ceiling. + +### The coefficient + +With the workload isolation in place, we swept encryption across CPU allocations. The throughput ceiling scaled linearly: + +| CPU limit | Encryption ceiling | +|-----------|-------------------| +| 1000m | ~40k msg/sec | +| 2000m | ~80k msg/sec | +| 4000m | ~160k msg/sec | + +From the 4-core sweep: safe at 160k msg/sec (p99: 447 ms), catastrophic at 320k msg/sec (p99: 537,000 ms). The saturation point is predictably between those two steps. + +Deriving the coefficient: at 4000m and 160k msg/sec with 1 KB messages — + +``` +160k msg/s × 1 KB = 160 MB/s produce throughput +With matched consumer load: 160 MB/s encrypt + 160 MB/s decrypt +→ 4000 mc / 320 MB/s bidirectional ≈ 12–13 mc per MB/s bidirectional +→ equivalently: 4000 mc / 160 MB/s produce ≈ 25 mc per MB/s produce +``` + +We measured the coefficient at mid-utilisation (80k msg/sec, 2000m) at ~10 mc/MB/s bidirectional — lower, because of fixed per-connection overhead that's amortised at higher load. The operator-facing formula uses 20 mc/MB/s of produce throughput (= 10 bidirectional × 2 for produce+consume), which sits between mid-utilisation and saturation and provides inherent conservatism. + +One thing we observed: the proxy had 4 Netty event loop threads regardless of CPU limit. The throughput scaling isn't explained by thread count changing — it doesn't. What changes is the CPU time budget available to those threads. The detailed relationship between CPU limit, thread scheduling, and throughput ceiling is more subtle than a simple thread-count model; what we can say empirically is that throughput scales linearly with the CPU limit, and the formula holds. + +### The prediction + +Rather than just reporting the 4-core result, we used the 1-core ceiling to make a falsifiable prediction: if the ceiling scales linearly, a 2-core pod should saturate at ~80k msg/sec. + +The 2-core sweep: + +| Rate | p99 | Verdict | +|------|-----|---------| +| 40k msg/sec | 626 ms | Comfortable | +| 80k msg/sec | 1,660 ms | Elevated — right at predicted ceiling | +| 160k msg/sec | 175,277 ms | Catastrophic | + +The prediction held. The ceiling is real, linear, and predictable — which is exactly what you want from a sizing model. + +Setting `requests` equal to `limits` makes this predictability practical: a pod that can burst above its CPU limit introduces headroom uncertainty that breaks the model. With `requests == limits`, the CPU budget is fixed, the ceiling is fixed, and your capacity planning can rely on the coefficient. + +Worth noting: with RF=3 in production, every message the Kafka leader receives goes out to 2 follower replicas. At 50k msg/sec with 1 KB messages that's ~1.2 Gbps outbound from the leader alone — confirming why the Fyre cluster nodes need 10 GbE NICs, and why the replication ceiling matters for the benchmarking workload design. + +## Bugs we found in our own tooling + +During the 4-producer rate sweep, we noticed that JFR recordings and flamegraphs from probes 2 onwards all looked identical to probe 1. They were stale copies. Three bugs. + +**Bug 1 — wrong JFR settings**: When restarting JFR for a subsequent probe in `--skip-deploy` mode, the script was using `settings=default` instead of `settings=profile`. The default profile omits I/O events including `jdk.NetworkUtilization` — the event we were using to read network throughput from JFR. Fixed to always use `settings=profile`. + +**Bug 2 — async-profiler not restarted**: The restart block restarted JFR but never restarted async-profiler. All probes after the first had a flamegraph from probe 1 only. + +**Bug 3 — wrong guard variable**: The async-profiler restart was guarded by checking `AGENT_LIB` (the path to the native library). `AGENT_LIB` is always set when the library exists on the image — even when profiling was intentionally skipped on clusters where the `Unconfined` seccomp profile couldn't be applied. The correct guard is `ASYNC_PROFILER_FLAGS`, which is only set when the seccomp patch was successfully applied. + +Spotting these required noticing that two different probe flamegraphs were pixel-for-pixel identical, then working back through the restart logic. The lesson: when reusing a deployed cluster across multiple probes, validate that diagnostic collection is actually running fresh for each one. + +## Run it yourself + +Everything is in `kroxylicious-openmessaging-benchmarks/` in the [main Kroxylicious repository](https://github.com/kroxylicious/kroxylicious). See `QUICKSTART.md` for step-by-step instructions. You'll need a Kubernetes or OpenShift cluster, the Kroxylicious operator installed, and Helm 3. Minikube works for local runs — the quickstart covers recommended CPU and memory settings. + +```bash +# Run a baseline vs encryption comparison +./scripts/run-benchmark.sh --scenario baseline +./scripts/run-benchmark.sh --scenario encryption + +# Compare results +jbang src/main/java/io/kroxylicious/benchmarks/results/ResultComparator.java \ + results/baseline results/encryption +``` + +## What's still open + +The coefficient is validated at 1, 2, and 4 cores for 1 KB messages. Known gaps: + +- **Message size variation**: larger messages should show lower overhead as a percentage; smaller messages may show higher. 1 KB is a reasonable middle ground but not the whole picture. +- **Horizontal scaling**: multiple proxy pods haven't been measured; linear scaling is expected but not confirmed. +- **Multi-pass sweeps**: each rate point was measured once. Running each probe three times and taking the median would give tighter bounds in the saturation transition zone. + +The operator-facing sizing reference and all the key tables are in `SIZING-GUIDE.md` in the benchmarks directory. diff --git a/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html b/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html new file mode 100644 index 00000000..89215a71 --- /dev/null +++ b/assets/blog/flamegraphs/benchmarking-the-proxy/encryption-cpu-profile-36k.html @@ -0,0 +1,28030 @@ + + + + + + + +

encryption/1topic-1kb_2026-04-20T11:38:12Z

+
+ + + +
+
Produced by async-profiler
+
+
+
Frame types
+
Kernel
+
Native
+
C++ (VM)
+
Java compiled
+
Java compiled by C1
+
Inlined
+
Interpreted
+
+
+
Allocation profile
+
Allocated class
+
Allocation outside TLAB
+
Lock profile
+
Lock class
+
 
+
Search
+
Matches regexp
+
+
+
Click frame
Zoom into frame
+
Alt+Click
Remove stack
+
0
Reset zoom
+
I
Invert graph
+
Ctrl+F
Search
+
N
Next match
+
Shift+N
Previous match
+
Esc
Cancel search
+
+
+ +
+

+

Matched:

+ diff --git a/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html b/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html new file mode 100644 index 00000000..c921470d --- /dev/null +++ b/assets/blog/flamegraphs/benchmarking-the-proxy/proxy-no-filters-cpu-profile.html @@ -0,0 +1,15382 @@ + + + + + + + +

proxy-no-filters/1topic-1kb_2026-04-15T21:44:15Z

+
+ + + +
+
Produced by async-profiler
+
+
+
Frame types
+
Kernel
+
Native
+
C++ (VM)
+
Java compiled
+
Java compiled by C1
+
Inlined
+
Interpreted
+
+
+
Allocation profile
+
Allocated class
+
Allocation outside TLAB
+
Lock profile
+
Lock class
+
 
+
Search
+
Matches regexp
+
+
+
Click frame
Zoom into frame
+
Alt+Click
Remove stack
+
0
Reset zoom
+
I
Invert graph
+
Ctrl+F
Search
+
N
Next match
+
Shift+N
Previous match
+
Esc
Cancel search
+
+
+ +
+

+

Matched:

+ diff --git a/overview.markdown b/overview.markdown index 8af9ae22..b6b42abb 100644 --- a/overview.markdown +++ b/overview.markdown @@ -66,5 +66,10 @@ Kroxylicious is careful to decode only the Kafka RPCs that the filters actually interested in a particular RPC, its bytes will pass straight through Kroxylicious. This approach helps keep Kroxylicious fast. -The actual performance overhead of using Kroxylicious depends on the particular use-case. +The actual performance overhead of using Kroxylicious depends on the particular use-case. As a guide: + +- **Passthrough proxy (no filters)**: ~0.2 ms additional average publish latency, no throughput impact +- **Record encryption (AES-256-GCM)**: ~26% throughput reduction per partition; 15–40 ms additional p99 latency at sub-saturation rates + +See the [performance reference page]({{ '/performance/' | absolute_url }}) for full benchmark results, methodology, and sizing guidance. diff --git a/performance.markdown b/performance.markdown new file mode 100644 index 00000000..058f650c --- /dev/null +++ b/performance.markdown @@ -0,0 +1,105 @@ +--- +layout: overview +title: Performance +permalink: /performance/ +toc: true +--- + +This page summarises the measured performance overhead of Kroxylicious. Numbers come from [benchmarks run on real hardware](/blog/2026/05/21/benchmarking-the-proxy/) using [OpenMessaging Benchmark (OMB)](https://github.com/openmessaging/benchmark), an industry-standard Kafka performance tool. No, we didn't run this on a laptop — it's a realistic deployment: a 6-node OpenShift cluster on Fyre, IBM's internal cloud platform — a controlled environment. + +## Test environment + +| Component | Details | +|-----------|---------| +| CPU | AMD EPYC-Rome, 2 GHz | +| Cluster | 6-node OpenShift, RHCOS 9.6 | +| Kafka | 3-broker Strimzi cluster, replication factor 3 | +| Kroxylicious | 0.20.0, single proxy pod, 1000m CPU limit | +| KMS | HashiCorp Vault (in-cluster) | + +All primary results used 1 KB messages on a single partition. Multi-topic workloads (10 and 100 topics) confirmed that overhead characteristics hold when load is distributed. + +--- + +## Passthrough proxy (no filters) + +The proxy layer itself adds negligible overhead. At sub-saturation rates the additional latency is sub-millisecond on average, with no measurable throughput impact. + +**10 topics, 1 KB messages (5,000 msg/sec per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.62 ms | 2.79 ms | +0.17 ms (+7%) | +| Publish latency p99 | 14.09 ms | 15.17 ms | +1.08 ms (+8%) | +| E2E latency avg | 94.87 ms | 95.34 ms | +0.47 ms (+0.5%) | +| Publish rate | 5,002 msg/s | 5,002 msg/s | no change | + +**100 topics, 1 KB messages (500 msg/sec per topic):** + +| Metric | Baseline | Proxy | Delta | +|--------|----------|-------|-------| +| Publish latency avg | 2.66 ms | 2.82 ms | +0.16 ms (+6%) | +| Publish latency p99 | 5.54 ms | 6.07 ms | +0.53 ms (+10%) | +| Publish rate | 500 msg/s | 500 msg/s | no change | + +--- + +## Record encryption (AES-256-GCM) + +Encryption adds measurable but predictable overhead. The cost scales with producer rate — well below saturation the overhead is small; approaching the saturation point, latency rises sharply. + +### Latency at sub-saturation rates + +**1 topic, 1 KB messages — baseline vs encryption:** + +| Rate | Metric | Baseline | Encryption | Delta | +|------|--------|----------|------------|-------| +| 34,000 msg/s | Publish avg | 8.00 ms | 8.19 ms | +0.19 ms (+2%) | +| 34,000 msg/s | Publish p99 | 48.65 ms | 64.01 ms | +15.35 ms (+32%) | +| 36,000 msg/s | Publish avg | 9.38 ms | 10.46 ms | +1.08 ms (+12%) | +| 36,000 msg/s | Publish p99 | 63.92 ms | 88.98 ms | +25.06 ms (+39%) | +| 37,200 msg/s | Publish avg | 9.12 ms | 12.19 ms | +3.07 ms (+34%) | +| 37,200 msg/s | Publish p99 | 74.88 ms | 113.15 ms | +38.27 ms (+51%) | + +### Throughput ceiling + +| Scenario | Throughput ceiling (1 topic, 1 KB, 1 partition) | +|----------|------------------------------------------------| +| Baseline (direct Kafka) | ~50,000–52,000 msg/sec | +| Encryption (proxy + AES-256-GCM) | ~37,200 msg/sec | +| **Cost** | **~26% fewer messages per second per partition** | + +--- + +## Sizing guidance + +Numbers without guidance aren't very useful, so here's how to translate these results into pod specs. + +**Passthrough proxy**: size your Kafka cluster as you normally would. The proxy will not be the bottleneck. + +**With record encryption:** + +- **Throughput**: use `proxy CPU (millicores) = 20 × produce throughput (MB/s)` as a planning formula, then add ×1.3 headroom. Assumes matched consumer load and AMD EPYC-Rome 2 GHz with AES-NI — calibrate on your hardware. Validated at 1000m, 2000m, and 4000m. Example: 100k msg/s at 1 KB = 100 MB/s produce → 2000m + headroom → ~2600m. +- **Latency**: expect 0.2–3 ms additional average publish latency and 15–40 ms additional p99, scaling with how close to saturation you operate +- **Scaling**: set `requests` equal to `limits` in your pod spec to make the CPU budget — and therefore the throughput ceiling — deterministic. Increase the CPU limit to raise throughput; add proxy pods for redundancy. +- **KMS**: DEK caching means the KMS is not on the hot path. In testing, each benchmark run triggered only 5–19 DEK generation calls — the KMS is not a bottleneck + +--- + +## Caveats + +These numbers come from a single proxy pod, 1 KB messages, and single-pass measurements. A few things that matter when applying them to your workload: + +- **Message size**: the sizing coefficient is message-size-dependent — encryption overhead as a percentage is likely lower for larger messages +- **Replication factor**: the 1-topic latency and ceiling results ran at RF=3; at that replication factor Kafka's ISR replication creates a per-partition ceiling that sits close to where proxy CPU saturates. The sizing coefficient was derived from RF=1 multi-topic workloads to isolate proxy CPU +- **Horizontal scaling**: linear scaling has been validated across CPU allocations on a single pod; multi-pod scaling hasn't been measured but is expected to follow the same coefficient + +The [engineering post](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) has the full methodology detail. + +--- + +## Further reading + +- [Operator guide: results, methodology, and sizing recommendations](/blog/2026/05/21/benchmarking-the-proxy/) — the full benchmark story for operators +- [How hard can it be??? Maxing out a Kroxylicious instance](/blog/2026/05/28/benchmarking-the-proxy-under-the-hood/) — how we measured it, where the CPU goes, and what surprised us +- [Benchmark quickstart](https://github.com/kroxylicious/kroxylicious/tree/main/kroxylicious-openmessaging-benchmarks/QUICKSTART.md) — run the benchmarks yourself \ No newline at end of file