Skip to content

fix(http): unblock request reader after Zig 0.16 migration#10

Open
Kures wants to merge 2 commits into
nullclaw:mainfrom
Kures:fix/http-read-blocking
Open

fix(http): unblock request reader after Zig 0.16 migration#10
Kures wants to merge 2 commits into
nullclaw:mainfrom
Kures:fix/http-read-blocking

Conversation

@Kures
Copy link
Copy Markdown

@Kures Kures commented May 12, 2026

PR5 — fix(http): unblock request reader after Zig 0.16 migration

Branch: fix/http-read-blocking
Fixes: every HTTP endpoint times out on main since 54455f9 (PR #5, Zig 0.16 merge, 2026-04-17). The published 2026.3.2 image is unaffected because it predates that merge.

Why

After the Zig 0.16 migration landed, anyone building from main HEAD
finds that curl http://localhost:8080/health hangs until its timeout.
TCP handshakes succeed, the request is read, the handler runs, but the
response only reaches the wire 3-5 seconds after the request — by
which time the client has already given up.

The unit-test suite (zig build test) stayed green at 340/340 because
tests exercise api.handleRequest directly and never go through the
live accept-read-write loop in src/main.zig.

Root cause

Not the writer chain. The reader.

readHttpRequest calls reader.interface.readSliceShort(&chunk) with
chunk: [4096]u8. Per the Zig 0.16 documentation:

Returns the number of bytes read, which is less than buffer.len if
and only if the stream reached the end.

So readSliceShort blocks until the destination buffer is fully
filled or the stream EOFs. A typical GET /health is ~84 bytes, so
the reader sat waiting for another 4012 bytes that would never come.
The connection only progressed when curl gave up after its timeout
and FIN'd the socket — at which point readSliceShort returned the 84
bytes it had been holding, the handler ran, and the response was sent
into a socket the client had already closed.

tcpdump on the loopback interface confirmed the timeline (request
ACK'd at t=0, response packet leaving the server at t≈5s, RST from
the client immediately after).

Fix

Replace readSliceShort with readVec. readVec performs at most
one recv syscall per call and returns whatever the kernel has —
typically the entire HTTP request in one shot. The surrounding
while loop already handles fragmented requests correctly.

Diff (src/main.zig, in readHttpRequest):

-        const n = try reader.interface.readSliceShort(&chunk);
+        var data: [1][]u8 = .{&chunk};
+        const n = reader.interface.readVec(&data) catch |err| switch (err) {
+            error.EndOfStream => return null,
+            else => |e| return e,
+        };

Four added lines, one deleted. No other code changes.

Verification

Reproducer from the issue draft (build main from source, run with
the unmodified ziglang/zig:0.16.0 toolchain in Docker, hit
localhost:8080 with curl):

Before
  $ curl -m 5 -i http://localhost:8080/health
  curl: (28) Operation timed out after 5001 ms with 0 bytes received

After
  $ curl -m 5 -i http://localhost:8080/health
  HTTP/1.1 200 OK
  Content-Type: application/json
  X-Request-Id: ab01511c-3cc9-4114-bec3-dd5c060b4147
  Content-Length: 70
  Connection: close

  {"status":"ok","version":"2026.3.2","active_runs":0,"total_workers":0}

Per-request timing measured in the accept loop:

Phase Before After
accept 0 ms 0 ms
readHttpRequest 5002 ms 1 ms
handler + write <1 ms <1 ms

/health and /metrics both respond in roughly 2 ms end-to-end.
zig build test --summary all is still 340/340 passing — the fix
does not regress any covered behaviour.

Closing the test-coverage gap

The reason this regression slipped through was that none of the 340
unit tests exercised the live accept-read-write loop — they all call
api.handleRequest directly. So a second commit on this branch
(e7fc120) adds an in-process integration test that:

  • binds 127.0.0.1:0, connects a libc client socket from the same
    thread (kernel's listen backlog handles the lack of a separate
    thread),
  • sends a small HTTP request with no FIN (write side stays open),
  • calls readHttpRequest on the accepted Stream,
  • asserts the call returns in under 200 ms with the request parsed
    correctly.

SO_RCVTIMEO of 2 s on the server-side socket converts a regressed
"hangs forever" into a clean failure rather than a hung CI run.

Verified the test catches the regression by temporarily reverting
readVec to readSliceShort:

Configuration Result
readSliceShort (regressed) 340 pass, 1 crash (the new test) within ~2 s
readVec (fixed) 341/341 in 1 s

The test is skipped on Windows because it reaches directly into libc
socket primitives. A platform-portable equivalent would need a
Winsock branch, which we do not need for the regression we are
guarding here.

Why I'm filing this as a PR

The original issue draft hypothesised about writer.flush() swallowing
the bytes and listed two failed workarounds. Those guesses were on the
wrong layer — the writer chain is fine, it just never gets a chance
to run because the reader stalls upstream. Once the reader is fixed,
the unmodified writer code from the Zig 0.16 migration delivers the
response correctly.

Kures added 2 commits May 8, 2026 20:52
Every HTTP request was timing out because readSliceShort blocks until
the destination buffer is fully filled or the stream reaches EOF. The
read chunk is 4096 bytes, but a typical request (e.g. GET /health) is
~84 bytes — so the reader sat waiting for 4 KB of input that never
came. After the curl client gave up and FIN'd the connection, the
reader would finally return; by then the response was useless.

Switch to readVec, which performs a single recv per call and returns
whatever the kernel had — typically the entire request in one shot.

Effect:
- Before: every endpoint hangs until client timeout, then ~5s after
  the request the response is sent into a closed socket.
- After: /health and /metrics respond in <2 ms.

zig build test still 340/340 because the test harness invokes
api.handleRequest directly and never exercises this read path. The
right structural follow-up is a black-box integration test that
spawns the binary and asserts a real HTTP response, but that's out of
scope for this fix.
…ession

The prior commit fixed the symptom (readVec instead of readSliceShort
in readHttpRequest) but left the structural gap that allowed the
regression to slip through: nothing in the unit-test suite exercises
the live accept-read-write loop. All 340 tests invoke api.handleRequest
directly, so a stall in the read step is invisible to them.

This test binds 127.0.0.1:0, connects from a libc client socket on
the same thread (kernel's listen backlog handles the race), sends a
short HTTP request without ever closing the write side, and calls
readHttpRequest on the accepted Stream. The assertion is wall-clock
under 200 ms — comfortable headroom over loopback variance, far below
the ~5 s the regression would take to surface (the read had to wait
for the buffer to fill or for an EOF from the client; neither
happens here).

A SO_RCVTIMEO of 2 s on the server-side socket converts a regressed
"hangs forever" into a clean test failure rather than a hung CI run.

Verified two ways:
- with the fix in place: 341/341 pass
- with readVec reverted to readSliceShort: this test fails (340 pass,
  1 crash) within the SO_RCVTIMEO window

Skipped on Windows because the test reaches into libc socket
primitives directly; the platform-portable equivalent would need a
Winsock branch we do not need for the regression we are guarding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant