fix: reduce buffer ring memory 8x + fix H2 stream leak on connection close by FumingPower3925 · Pull Request #109 · goceleris/celeris

FumingPower3925 · 2026-03-25T10:44:50Z

Summary

Two critical fixes for the metal benchmark regressions observed in v1.1.0-beta.9.

Fix 1: Buffer ring memory reduction (CRITICAL)

Reduce bufRingCount from 4096 to 512 per io_uring worker.

Root cause: Each worker allocates bufRingCount * BufferSize bytes of mmap'd memory for the provided buffer ring. With the Greedy preset (BufferSize=65536) and bufRingCount=4096, each worker used 256 MB. On metal instances:

Instance	vCPUs	Workers	Buffer Ring RSS	Observed RSS
c7g.metal	64	64	16 GB	17 GB
c7i.metal-24xl	96	96	24 GB	25 GB

v1.0.0 had no provided buffer ring (it used per-connection recv buffers), which is why it showed only 370-470 MB RSS.

With bufRingCount=512, each worker uses 32 MB, reducing total to 2-3 GB on metal. 512 concurrent multishot recvs per worker is more than sufficient.

Fix 2: H2 stream leak on connection close

Add Manager.Close() method and call it from CloseH2().

When an H2 connection closed, CloseH2() only cleaned up the encoder but left all Stream objects in Manager.streams map referenced indefinitely. Over many connection cycles, this leaked stream objects, headers slices, and outbound buffers.

Manager.Close() iterates the map, releases non-async streams back to the pool, and clears pending window updates.

Impact

These fixes address the reported metal benchmark issues:

25 GB RSS on x86 c7i.metal-24xl → expected ~3 GB
17 GB RSS on arm64 c7g.metal → expected ~2 GB
Missing adaptive configs (3 configs crashed, likely OOM from combined engine memory) → should resolve
5-10% H1 RPS regression on ARM64 → likely caused by TLB/cache pressure from 16+ GB mmap'd regions

Test plan

mage fullCompliance — all 4 phases pass
All tests pass with race detector
h2spec: 146/146 on io_uring and epoll
Re-run metal benchmarks from benchmarks repo to verify RSS and RPS

…close Two critical fixes for the metal benchmark regressions: 1. Reduce bufRingCount from 4096 to 512 per worker. The previous value caused 16-25 GB RSS on metal instances (64-96 vCPU) because each worker allocated 4096 * 65536 = 256 MB of mmap'd buffer ring memory. With 512 buffers per worker, memory drops to 32 MB/worker (2-3 GB total on metal). 512 concurrent multishot recvs per worker is more than sufficient for any workload. v1.0.0 had NO provided buffer ring (it used per-connection buffers), which is why v1.0.0 showed 370-470 MB RSS vs beta.9's 17-25 GB. 2. Add Manager.Close() and call it from CloseH2(). When an H2 connection closed, all Stream objects in Manager.streams remained referenced indefinitely. Over many connections, this leaked Stream objects, headers slices, and outbound buffers. Manager.Close() iterates the map, releases non-async streams, and clears pending window updates. These fixes address the reported metal benchmark issues: - 25 GB RSS on x86 c7i.metal-24xl (96 vCPU) - 17 GB RSS on arm64 c7g.metal (64 vCPU) - Missing adaptive configs (OOM from combined engine memory)

Add mage target for full-matrix celeris benchmarking with three-way version comparison (v1.0.0, HEAD, current). Tests all 54 configurations: 3 engines × 3 objectives × 3 protocols × 2 presets (Greedy + Minimal). Server binary (test/benchmark/server) now accepts: - PRESET env var: "greedy" or "minimal" (was hardcoded to greedy) - WORKERS env var: explicit worker count override - /upload route: body benchmark support (POST with request body) The target uses wrk (H1) and h2load (H2) as load generators, with the benchmarks repo's bench client available for richer metrics when needed. Server binaries for published versions are built from git worktrees.

…port Rewrite the metal benchmark target to collect RPS from wrk/h2load output inline (same pattern as CloudBenchmarkSplit) and generate a comparison report with per-engine, per-protocol, and per-preset breakdowns. Results are NOT committed — they're printed and saved as a text report to results/<timestamp>-cloud-metal-benchmark/report-<arch>.txt. The report includes: - Full config table with RPS for each ref + delta% - Per-engine averages (iouring, epoll, adaptive) - Per-protocol averages (h1, h2, hybrid) - Per-preset averages (greedy, minimal) - Failure list (configs that didn't start)

…metal target

Remove the Greedy/Minimal preset system and the Latency/Throughput/Balanced objective system. Benchmark data across 54 configurations shows these produce statistically identical results (0.4% spread between objectives, 1.00x H1 ratio between presets). Hardcode sensible defaults from the former "Balanced" objective and "Greedy" preset: Workers=GOMAXPROCS, BufferSize=65536, TCPNoDelay=true, TCPQuickAck=true, SOBusyPoll=50µs, EpollTimeout=1ms, SQPollIdle=2s. Users can still override Workers directly in Config. Public API changes: - Remove celeris.Objective type and Latency/Throughput/Balanced constants - Remove Config.Objective field - Keep Config.Workers (user-overridable) Internal changes: - Delete resource/objective.go and associated tests - Simplify resource/preset.go to resolveDefaults() - Remove ObjectiveParams from engine constructors and worker structs - Hardcode socket options in epoll and iouring engines - Simplify mage targets from engine×objective×protocol to engine×protocol (benchmark configs: 27 → 9 celeris, metal: 54 → 9) -593 net lines removed.

Change the zero-value defaults so celeris.New(celeris.Config{Addr: ":8080"}) uses the optimal configuration out of the box: - Engine: Adaptive (was IOUring). Adaptive starts with epoll — which has 9% better H2 throughput than io_uring on current kernels due to its single-pass read→process→write model — and can switch to io_uring if telemetry indicates it would perform better for the workload. - Protocol: Auto (was HTTP1). Auto-detect H1 vs H2 from the first bytes on the wire, supporting both protocols on the same port with zero overhead for H1 connections (~50ns amortized detection per connection). Implementation: add sentinel zero values (engineDefault, protocolDefault) to the iota enums so Config{} can be distinguished from explicit Config{Engine: IOUring}. Defaults are resolved in Config.WithDefaults(): Adaptive on Linux, Std on other platforms. Also flips Adaptive to start with epoll as primary (was io_uring), matching benchmark data showing epoll's H2 advantage.

Dedicated mage target that stress-tests the adaptive engine under synthetic load patterns for each protocol (h1, h2, hybrid): 1. Steady-state warmup (30s, 256 conns) — baseline RPS 2. High concurrency burst (15s, 4096 conns) — stress test 3. Low concurrency recovery (15s, 32 conns) — drain behavior 4. Steady state again (30s) — stability check 5. Idle period (60s) then reload — eval cycle after silence Captures server logs to extract switch events, scoring decisions, oscillation locks, and engine type transitions. Reports switch count, timing, and RPS at each phase. Also fixes adaptive engine comments and H2 initial engine selection after the epoll-primary swap.

Upgrade CloudIOUringSweep to test on both architectures with: - Large instances: c7g.16xlarge (64 vCPU arm64), c7i.16xlarge (64 vCPU x86) - Connection range: 16 to 65536 - Both single-worker and default (GOMAXPROCS) modes - H1 and H2 protocols - Full report with per-config tables and best/worst summary The goal: find any workload where io_uring outperforms epoll to inform the adaptive engine's switching criteria.

FumingPower3925 added 4 commits March 25, 2026 11:33

fix: ensure /tmp/metal exists on client, check all SSH/SCP errors in …

e7b1e70

…metal target

FumingPower3925 force-pushed the fix/beta10-investigation branch from a5ab48f to df5abc5 Compare March 25, 2026 13:40

FumingPower3925 force-pushed the fix/beta10-investigation branch from df5abc5 to d056d16 Compare March 25, 2026 13:49

FumingPower3925 added 3 commits March 25, 2026 15:03

infra: profile all 9 engine×protocol combinations

6b9f99b

FumingPower3925 force-pushed the fix/beta10-investigation branch from 6f353a1 to 1b4dcb0 Compare March 25, 2026 14:57

FumingPower3925 added 2 commits March 25, 2026 16:15

infra: add CloudIOUringSweep for io_uring vs epoll connection sweep

73f5d66

FumingPower3925 force-pushed the fix/beta10-investigation branch from 762f0df to 82d9583 Compare March 25, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce buffer ring memory 8x + fix H2 stream leak on connection close#109

fix: reduce buffer ring memory 8x + fix H2 stream leak on connection close#109
FumingPower3925 wants to merge 10 commits intomainfrom
fix/beta10-investigation

FumingPower3925 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FumingPower3925 commented Mar 25, 2026

Summary

Fix 1: Buffer ring memory reduction (CRITICAL)

Fix 2: H2 stream leak on connection close

Impact

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant