Skip to content

fix: reduce buffer ring memory 8x + fix H2 stream leak on connection close#109

Open
FumingPower3925 wants to merge 10 commits intomainfrom
fix/beta10-investigation
Open

fix: reduce buffer ring memory 8x + fix H2 stream leak on connection close#109
FumingPower3925 wants to merge 10 commits intomainfrom
fix/beta10-investigation

Conversation

@FumingPower3925
Copy link
Copy Markdown
Contributor

Summary

Two critical fixes for the metal benchmark regressions observed in v1.1.0-beta.9.

Fix 1: Buffer ring memory reduction (CRITICAL)

Reduce bufRingCount from 4096 to 512 per io_uring worker.

Root cause: Each worker allocates bufRingCount * BufferSize bytes of mmap'd memory for the provided buffer ring. With the Greedy preset (BufferSize=65536) and bufRingCount=4096, each worker used 256 MB. On metal instances:

Instance vCPUs Workers Buffer Ring RSS Observed RSS
c7g.metal 64 64 16 GB 17 GB
c7i.metal-24xl 96 96 24 GB 25 GB

v1.0.0 had no provided buffer ring (it used per-connection recv buffers), which is why it showed only 370-470 MB RSS.

With bufRingCount=512, each worker uses 32 MB, reducing total to 2-3 GB on metal. 512 concurrent multishot recvs per worker is more than sufficient.

Fix 2: H2 stream leak on connection close

Add Manager.Close() method and call it from CloseH2().

When an H2 connection closed, CloseH2() only cleaned up the encoder but left all Stream objects in Manager.streams map referenced indefinitely. Over many connection cycles, this leaked stream objects, headers slices, and outbound buffers.

Manager.Close() iterates the map, releases non-async streams back to the pool, and clears pending window updates.

Impact

These fixes address the reported metal benchmark issues:

  • 25 GB RSS on x86 c7i.metal-24xl → expected ~3 GB
  • 17 GB RSS on arm64 c7g.metal → expected ~2 GB
  • Missing adaptive configs (3 configs crashed, likely OOM from combined engine memory) → should resolve
  • 5-10% H1 RPS regression on ARM64 → likely caused by TLB/cache pressure from 16+ GB mmap'd regions

Test plan

  • mage fullCompliance — all 4 phases pass
  • All tests pass with race detector
  • h2spec: 146/146 on io_uring and epoll
  • Re-run metal benchmarks from benchmarks repo to verify RSS and RPS

…close

Two critical fixes for the metal benchmark regressions:

1. Reduce bufRingCount from 4096 to 512 per worker. The previous value
   caused 16-25 GB RSS on metal instances (64-96 vCPU) because each
   worker allocated 4096 * 65536 = 256 MB of mmap'd buffer ring memory.
   With 512 buffers per worker, memory drops to 32 MB/worker (2-3 GB
   total on metal). 512 concurrent multishot recvs per worker is more
   than sufficient for any workload.

   v1.0.0 had NO provided buffer ring (it used per-connection buffers),
   which is why v1.0.0 showed 370-470 MB RSS vs beta.9's 17-25 GB.

2. Add Manager.Close() and call it from CloseH2(). When an H2 connection
   closed, all Stream objects in Manager.streams remained referenced
   indefinitely. Over many connections, this leaked Stream objects,
   headers slices, and outbound buffers. Manager.Close() iterates the
   map, releases non-async streams, and clears pending window updates.

These fixes address the reported metal benchmark issues:
- 25 GB RSS on x86 c7i.metal-24xl (96 vCPU)
- 17 GB RSS on arm64 c7g.metal (64 vCPU)
- Missing adaptive configs (OOM from combined engine memory)
Add mage target for full-matrix celeris benchmarking with three-way
version comparison (v1.0.0, HEAD, current). Tests all 54 configurations:
3 engines × 3 objectives × 3 protocols × 2 presets (Greedy + Minimal).

Server binary (test/benchmark/server) now accepts:
- PRESET env var: "greedy" or "minimal" (was hardcoded to greedy)
- WORKERS env var: explicit worker count override
- /upload route: body benchmark support (POST with request body)

The target uses wrk (H1) and h2load (H2) as load generators, with the
benchmarks repo's bench client available for richer metrics when needed.
Server binaries for published versions are built from git worktrees.
…port

Rewrite the metal benchmark target to collect RPS from wrk/h2load output
inline (same pattern as CloudBenchmarkSplit) and generate a comparison
report with per-engine, per-protocol, and per-preset breakdowns.

Results are NOT committed — they're printed and saved as a text report
to results/<timestamp>-cloud-metal-benchmark/report-<arch>.txt.

The report includes:
- Full config table with RPS for each ref + delta%
- Per-engine averages (iouring, epoll, adaptive)
- Per-protocol averages (h1, h2, hybrid)
- Per-preset averages (greedy, minimal)
- Failure list (configs that didn't start)
@FumingPower3925 FumingPower3925 force-pushed the fix/beta10-investigation branch from a5ab48f to df5abc5 Compare March 25, 2026 13:40
Remove the Greedy/Minimal preset system and the Latency/Throughput/Balanced
objective system. Benchmark data across 54 configurations shows these
produce statistically identical results (0.4% spread between objectives,
1.00x H1 ratio between presets).

Hardcode sensible defaults from the former "Balanced" objective and "Greedy"
preset: Workers=GOMAXPROCS, BufferSize=65536, TCPNoDelay=true,
TCPQuickAck=true, SOBusyPoll=50µs, EpollTimeout=1ms, SQPollIdle=2s.

Users can still override Workers directly in Config.

Public API changes:
- Remove celeris.Objective type and Latency/Throughput/Balanced constants
- Remove Config.Objective field
- Keep Config.Workers (user-overridable)

Internal changes:
- Delete resource/objective.go and associated tests
- Simplify resource/preset.go to resolveDefaults()
- Remove ObjectiveParams from engine constructors and worker structs
- Hardcode socket options in epoll and iouring engines
- Simplify mage targets from engine×objective×protocol to engine×protocol
  (benchmark configs: 27 → 9 celeris, metal: 54 → 9)

-593 net lines removed.
@FumingPower3925 FumingPower3925 force-pushed the fix/beta10-investigation branch from df5abc5 to d056d16 Compare March 25, 2026 13:49
Change the zero-value defaults so celeris.New(celeris.Config{Addr: ":8080"})
uses the optimal configuration out of the box:

- Engine: Adaptive (was IOUring). Adaptive starts with epoll — which has
  9% better H2 throughput than io_uring on current kernels due to its
  single-pass read→process→write model — and can switch to io_uring if
  telemetry indicates it would perform better for the workload.

- Protocol: Auto (was HTTP1). Auto-detect H1 vs H2 from the first bytes
  on the wire, supporting both protocols on the same port with zero
  overhead for H1 connections (~50ns amortized detection per connection).

Implementation: add sentinel zero values (engineDefault, protocolDefault)
to the iota enums so Config{} can be distinguished from explicit
Config{Engine: IOUring}. Defaults are resolved in Config.WithDefaults():
Adaptive on Linux, Std on other platforms.

Also flips Adaptive to start with epoll as primary (was io_uring),
matching benchmark data showing epoll's H2 advantage.
Dedicated mage target that stress-tests the adaptive engine under
synthetic load patterns for each protocol (h1, h2, hybrid):

1. Steady-state warmup (30s, 256 conns) — baseline RPS
2. High concurrency burst (15s, 4096 conns) — stress test
3. Low concurrency recovery (15s, 32 conns) — drain behavior
4. Steady state again (30s) — stability check
5. Idle period (60s) then reload — eval cycle after silence

Captures server logs to extract switch events, scoring decisions,
oscillation locks, and engine type transitions. Reports switch count,
timing, and RPS at each phase.

Also fixes adaptive engine comments and H2 initial engine selection
after the epoll-primary swap.
@FumingPower3925 FumingPower3925 force-pushed the fix/beta10-investigation branch from 6f353a1 to 1b4dcb0 Compare March 25, 2026 14:57
Upgrade CloudIOUringSweep to test on both architectures with:
- Large instances: c7g.16xlarge (64 vCPU arm64), c7i.16xlarge (64 vCPU x86)
- Connection range: 16 to 65536
- Both single-worker and default (GOMAXPROCS) modes
- H1 and H2 protocols
- Full report with per-config tables and best/worst summary

The goal: find any workload where io_uring outperforms epoll to inform
the adaptive engine's switching criteria.
@FumingPower3925 FumingPower3925 force-pushed the fix/beta10-investigation branch from 762f0df to 82d9583 Compare March 25, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant