fix: reduce buffer ring memory 8x + fix H2 stream leak on connection close#109
Open
FumingPower3925 wants to merge 10 commits intomainfrom
Open
fix: reduce buffer ring memory 8x + fix H2 stream leak on connection close#109FumingPower3925 wants to merge 10 commits intomainfrom
FumingPower3925 wants to merge 10 commits intomainfrom
Conversation
…close Two critical fixes for the metal benchmark regressions: 1. Reduce bufRingCount from 4096 to 512 per worker. The previous value caused 16-25 GB RSS on metal instances (64-96 vCPU) because each worker allocated 4096 * 65536 = 256 MB of mmap'd buffer ring memory. With 512 buffers per worker, memory drops to 32 MB/worker (2-3 GB total on metal). 512 concurrent multishot recvs per worker is more than sufficient for any workload. v1.0.0 had NO provided buffer ring (it used per-connection buffers), which is why v1.0.0 showed 370-470 MB RSS vs beta.9's 17-25 GB. 2. Add Manager.Close() and call it from CloseH2(). When an H2 connection closed, all Stream objects in Manager.streams remained referenced indefinitely. Over many connections, this leaked Stream objects, headers slices, and outbound buffers. Manager.Close() iterates the map, releases non-async streams, and clears pending window updates. These fixes address the reported metal benchmark issues: - 25 GB RSS on x86 c7i.metal-24xl (96 vCPU) - 17 GB RSS on arm64 c7g.metal (64 vCPU) - Missing adaptive configs (OOM from combined engine memory)
Add mage target for full-matrix celeris benchmarking with three-way version comparison (v1.0.0, HEAD, current). Tests all 54 configurations: 3 engines × 3 objectives × 3 protocols × 2 presets (Greedy + Minimal). Server binary (test/benchmark/server) now accepts: - PRESET env var: "greedy" or "minimal" (was hardcoded to greedy) - WORKERS env var: explicit worker count override - /upload route: body benchmark support (POST with request body) The target uses wrk (H1) and h2load (H2) as load generators, with the benchmarks repo's bench client available for richer metrics when needed. Server binaries for published versions are built from git worktrees.
…port Rewrite the metal benchmark target to collect RPS from wrk/h2load output inline (same pattern as CloudBenchmarkSplit) and generate a comparison report with per-engine, per-protocol, and per-preset breakdowns. Results are NOT committed — they're printed and saved as a text report to results/<timestamp>-cloud-metal-benchmark/report-<arch>.txt. The report includes: - Full config table with RPS for each ref + delta% - Per-engine averages (iouring, epoll, adaptive) - Per-protocol averages (h1, h2, hybrid) - Per-preset averages (greedy, minimal) - Failure list (configs that didn't start)
a5ab48f to
df5abc5
Compare
Remove the Greedy/Minimal preset system and the Latency/Throughput/Balanced objective system. Benchmark data across 54 configurations shows these produce statistically identical results (0.4% spread between objectives, 1.00x H1 ratio between presets). Hardcode sensible defaults from the former "Balanced" objective and "Greedy" preset: Workers=GOMAXPROCS, BufferSize=65536, TCPNoDelay=true, TCPQuickAck=true, SOBusyPoll=50µs, EpollTimeout=1ms, SQPollIdle=2s. Users can still override Workers directly in Config. Public API changes: - Remove celeris.Objective type and Latency/Throughput/Balanced constants - Remove Config.Objective field - Keep Config.Workers (user-overridable) Internal changes: - Delete resource/objective.go and associated tests - Simplify resource/preset.go to resolveDefaults() - Remove ObjectiveParams from engine constructors and worker structs - Hardcode socket options in epoll and iouring engines - Simplify mage targets from engine×objective×protocol to engine×protocol (benchmark configs: 27 → 9 celeris, metal: 54 → 9) -593 net lines removed.
df5abc5 to
d056d16
Compare
Change the zero-value defaults so celeris.New(celeris.Config{Addr: ":8080"})
uses the optimal configuration out of the box:
- Engine: Adaptive (was IOUring). Adaptive starts with epoll — which has
9% better H2 throughput than io_uring on current kernels due to its
single-pass read→process→write model — and can switch to io_uring if
telemetry indicates it would perform better for the workload.
- Protocol: Auto (was HTTP1). Auto-detect H1 vs H2 from the first bytes
on the wire, supporting both protocols on the same port with zero
overhead for H1 connections (~50ns amortized detection per connection).
Implementation: add sentinel zero values (engineDefault, protocolDefault)
to the iota enums so Config{} can be distinguished from explicit
Config{Engine: IOUring}. Defaults are resolved in Config.WithDefaults():
Adaptive on Linux, Std on other platforms.
Also flips Adaptive to start with epoll as primary (was io_uring),
matching benchmark data showing epoll's H2 advantage.
Dedicated mage target that stress-tests the adaptive engine under synthetic load patterns for each protocol (h1, h2, hybrid): 1. Steady-state warmup (30s, 256 conns) — baseline RPS 2. High concurrency burst (15s, 4096 conns) — stress test 3. Low concurrency recovery (15s, 32 conns) — drain behavior 4. Steady state again (30s) — stability check 5. Idle period (60s) then reload — eval cycle after silence Captures server logs to extract switch events, scoring decisions, oscillation locks, and engine type transitions. Reports switch count, timing, and RPS at each phase. Also fixes adaptive engine comments and H2 initial engine selection after the epoll-primary swap.
6f353a1 to
1b4dcb0
Compare
Upgrade CloudIOUringSweep to test on both architectures with: - Large instances: c7g.16xlarge (64 vCPU arm64), c7i.16xlarge (64 vCPU x86) - Connection range: 16 to 65536 - Both single-worker and default (GOMAXPROCS) modes - H1 and H2 protocols - Full report with per-config tables and best/worst summary The goal: find any workload where io_uring outperforms epoll to inform the adaptive engine's switching criteria.
762f0df to
82d9583
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two critical fixes for the metal benchmark regressions observed in v1.1.0-beta.9.
Fix 1: Buffer ring memory reduction (CRITICAL)
Reduce
bufRingCountfrom 4096 to 512 per io_uring worker.Root cause: Each worker allocates
bufRingCount * BufferSizebytes of mmap'd memory for the provided buffer ring. With the Greedy preset (BufferSize=65536) andbufRingCount=4096, each worker used 256 MB. On metal instances:v1.0.0 had no provided buffer ring (it used per-connection recv buffers), which is why it showed only 370-470 MB RSS.
With
bufRingCount=512, each worker uses 32 MB, reducing total to 2-3 GB on metal. 512 concurrent multishot recvs per worker is more than sufficient.Fix 2: H2 stream leak on connection close
Add
Manager.Close()method and call it fromCloseH2().When an H2 connection closed,
CloseH2()only cleaned up the encoder but left allStreamobjects inManager.streamsmap referenced indefinitely. Over many connection cycles, this leaked stream objects, headers slices, and outbound buffers.Manager.Close()iterates the map, releases non-async streams back to the pool, and clears pending window updates.Impact
These fixes address the reported metal benchmark issues:
Test plan
mage fullCompliance— all 4 phases pass