perf: v1.1.0-beta.9 — 3.3M H2 rps, context leak fix, zero-alloc frame fast path#108
Merged
FumingPower3925 merged 5 commits intomainfrom Mar 25, 2026
Merged
perf: v1.1.0-beta.9 — 3.3M H2 rps, context leak fix, zero-alloc frame fast path#108FumingPower3925 merged 5 commits intomainfrom
FumingPower3925 merged 5 commits intomainfrom
Conversation
Tag EC2 instances with Project=celeris-mage and KeyPair=<run-key> so instances from different runs/branches/projects are distinguishable. Add cleanup scope logging to make it clear which resources are being terminated. Safety audit confirms: all instance termination uses explicit IDs tracked from launch — no tag-based discovery or bulk operations that could affect other workloads sharing the same AWS account.
…_uring H2) 9 profile-driven optimizations targeting H2 allocation hotspots and io_uring pipeline stalls: 1. io_uring: reorder H2 drain before dirty list (reduces pipeline stalls) 2. Async H2 adapter: manual HPACK content-length (~140ns/req savings) 3. H1: fused status 200 + date cached block (one append vs two) 4. acquireContext: remove redundant nil check 5. io_uring: conditional immediate Submit (skip when CQEs ready) 6. Fix H2 context pool leak: only cache Context on H1 streams. H2 streams are ephemeral — caching caused contexts to leak (never returned to pool). This was 87% of H2 allocations. 7. Pre-allocate Stream.Headers capacity for H2 (avoid first-use alloc) 8. Persistent HPACK emit function per Processor (eliminate per-request closure allocation in header decode) 9. Reorder IsCancelled after canRunInline (avoid atomic on hot inline path) Infrastructure: three-way benchmark comparison (main vs savepoint vs current) in CloudBenchmarkSplit for incremental optimization tracking. Cloud benchmark results (arm64 c7g.2xlarge, split server/client): io_uring H2: 930K → 2.73M rps (+190-244%) epoll H2: 2.37M → 2.95M rps (+23-27%) H1: ~580K → ~578K rps (stable, within noise) io_uring vs epoll H2 gap: -56-69% → -7.2-8.1% H2/H1 ratio: 470-510% (H2 is now 5x faster than H1)
…ng H2) Add a zero-allocation HEADERS frame fast path in ProcessH2 that bypasses the x/net/http2 framer for the common case: HEADERS frames with END_HEADERS set, no PADDED, no PRIORITY, not during CONTINUATION. The x/net framer allocates a *HeadersFrame struct per ReadFrame call, which was 98% of remaining H2 allocations (2.5 GB/20s at 3M rps). The fast path reads the 9-byte frame header directly from the recv buffer, extracts streamID/flags/payload, and passes them to a new ProcessRawHeaders method on the Processor that performs all RFC 7540 validations without allocating intermediate structs. Complex frames (PADDED, PRIORITY, CONTINUATION, non-HEADERS types) fall through to the existing x/net framer path unchanged. Cloud benchmark results (arm64 c7g.2xlarge, split server/client): io_uring H2: 2.90M → 3.24M rps (+11.5-12.3%) epoll H2: 3.09M → 3.26M rps (+4.7-5.6%) H1: ~566K → ~564K rps (stable) io_uring vs epoll H2 gap: ELIMINATED (within ±0.6%) H2/H1 ratio: 570-577% (H2 is ~5.7x faster than H1) h2spec: 146/146 on io_uring and epoll (no new failures)
- Add highlights section with headline numbers (3.3M H2 rps, 590K H1 rps) - Update benchmarks to cloud results from arm64 c7g.2xlarge - Add multishot recv, zero-alloc HEADERS, inline H2 handlers to feature matrix - Update methodology (wrk + h2load, 9-pass interleaved) - Update SECURITY.md: only >= 1.1.0 is supported
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Profile-driven optimization loop targeting HTTP/2 throughput and io_uring parity with epoll. Two performance commits plus one infrastructure commit.
Commit 1: Fix H2 context leak + 9 optimizations (+190% io_uring H2)
acquireContextcached contexts on ephemeral H2 streams, causing them to leak (never returned tosync.Pool). This was 87% of all H2 allocations. Fix: only cache on H1 streams (persistent keep-alive connections). H2 inline handlers useInlineCachedCtxinstead.cachedStatus200Dateblock (one append for status + date)Stream.Headerscapacity for H2 (avoid first-use allocation)IsCancelledaftercanRunInline(avoid atomic on hot inline path)acquireContextCommit 2: Zero-alloc HEADERS fast path (+12% io_uring H2)
Bypass x/net framer for common HEADERS frames (END_HEADERS set, no PADDED/PRIORITY, not during CONTINUATION). The framer's
*HeadersFrameallocation was 98% of remaining H2 allocations (2.5 GB/20s at 3M rps). The fast path reads the 9-byte frame header directly, extracts streamID/flags/payload, and passes them toProcessRawHeaders— all RFC 7540 validations preserved.Infrastructure: Three-way benchmark comparison
CloudBenchmarkSplitnow builds 3 binaries (main, HEAD savepoint, current working tree) and runs a 9-pass interleaved schedule for three-way comparison. Enables tracking both total improvement (vs main) and incremental improvement (vs last commit).Cloud Benchmark Results (arm64 c7g.2xlarge, split server/client)
Profile Analysis (post-optimization)
Test plan
mage fullCompliance— all 4 phases pass (unit tests with race detector, 9 fuzz targets, h1spec + h2spec 142/146, conformance matrix 9 engine×protocol combos, integration tests)mage cloudBenchmarkSplit— 9-pass interleaved A/B/C on arm64 c7g.2xlargemage cloudProfileSplit— CPU + allocation profiles on 4 configs