fix: add keepalive ping to POST SSE response streams#898
fix: add keepalive ping to POST SSE response streams#898whoiskatrin merged 2 commits intocloudflare:mainfrom
Conversation
The GET SSE handler sets up a 30-second keepalive interval that writes "event: ping" to prevent proxies from closing idle connections. The POST SSE handler was missing this keepalive, causing responses to be silently dropped during long-running tool calls. This adds the same keepalive pattern to the POST handler and includes tests verifying both GET and POST streams have keepalive, and that the interval is properly cleaned up. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: bbafdfc The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
/bonk review this PR |
There was a problem hiding this comment.
Review Summary
This is a clean, well-motivated bug fix. The problem is real and clearly articulated — the GET SSE handler has a 30-second keepalive ping to prevent proxies/load balancers from closing idle connections, but the POST SSE handler was missing it entirely. This causes POST response streams (e.g., tools/call) to be silently dropped during long-running operations.
Code change
The fix at packages/agents/src/mcp/worker-transport.ts:636-651 is correct:
- The
setIntervalplacement is right — it's set up immediately after creating the writer/encoder and before the stream mapping, exactly mirroring the GET handler pattern (line 350). - Cleanup is correct —
clearInterval(keepAlive)is called in thecleanup()closure, which is invoked via:closeSSEStream()(line 815) — explicit stream closesend()when all responses are ready (line 929) — normal completionclose()(line 793) — transport teardown
- The catch block in the interval handles the case where
writer.write()throws after the stream is already closed, preventing the interval from running forever. This matches the GET handler exactly.
Tests
The three test cases are well-structured:
- GET keepalive exists (regression guard for existing behavior)
- POST keepalive exists (the actual fix verification)
- POST keepalive cleanup on
closeSSEStream(verifies no leaked intervals)
The tests use vi.spyOn(globalThis, "setInterval") to verify the interval is created with the correct 30000ms delay, which is a reasonable approach for this kind of infrastructure behavior.
Changeset
Correctly included as a patch bump for agents.
One minor observation (non-blocking)
The clearIntervalSpy assertion in the cleanup test (expect(clearIntervalSpy).toHaveBeenCalled()) doesn't verify it was called with the specific interval ID from the keepalive. Since clearInterval may also be called from other internal cleanup paths, this assertion could theoretically pass for the wrong reason. A stricter assertion would capture the interval ID from the setInterval spy and assert clearIntervalSpy was called with that specific ID. This is minor — the test is still valuable as-is.
LGTM. This is a minimal, targeted fix that mirrors an existing pattern and addresses a real production issue affecting MCP clients.
|
Review submitted. Here's a summary of my findings: PR #898 — Approved This is a correct, minimal bug fix that adds a 30-second keepalive ping ( The code change is 9 lines, correctly placed, with proper cleanup in all three paths ( I noted one minor non-blocking observation: the cleanup test could be stricter by asserting |
commit: |
mattzcarey
left a comment
There was a problem hiding this comment.
LGTM not sure how I missed that
|
This could be the cause of #858 |
|
@bonk can you make sure check script passes. Make the smallest change possible and commit it. |
|
/bonk fix the npm run check issue |
There was a problem hiding this comment.
Review of PR #898: fix: add keepalive ping to POST SSE response streams
Summary
The core fix is correct and well-motivated. The POST SSE handler was missing the keepalive ping that the GET handler already has, causing proxy/LB connection drops during long-running tool calls. The 9-line change to worker-transport.ts correctly mirrors the existing GET handler pattern.
CI Failure
The check job is failing due to a TypeScript error in the new test file:
packages/agents/src/tests/mcp/transports/post-keepalive.test.ts(77,5): error TS2322:
Type 'MockInstance<{ (handler: TimerHandler, timeout?: number | undefined, ...arguments: any[]): number; ...'
is not assignable to type 'MockInstance<(this: unknown, ...args: unknown[]) => unknown>'.
The issue is at line 18/77 in the test file. The variable is declared as:
let setIntervalSpy: ReturnType<typeof vi.spyOn>;But vi.spyOn(globalThis, "setInterval") returns a more specific MockInstance type (matching setInterval's overloaded signatures in @cloudflare/workers-types), which doesn't assignably match the generic ReturnType<typeof vi.spyOn>.
Fix: Use the inferred type directly instead of the generic ReturnType:
// Option A: let TypeScript infer from the assignment (move init into declaration)
let setIntervalSpy = vi.spyOn(globalThis, "setInterval");
// Option B: explicitly type it to allow any mock instance
// biome-ignore lint: complex overload type
let setIntervalSpy: ReturnType<typeof vi.spyOn<typeof globalThis, "setInterval">>;The simplest fix is to initialize the spy at declaration time and restore it in afterEach, or use a type assertion at the assignment site:
setIntervalSpy = vi.spyOn(globalThis, "setInterval") as typeof setIntervalSpy;Code change review (worker-transport.ts)
The production code change at lines 636-651 is correct:
- The
setIntervalplacement is right — immediately after creating the writer/encoder, before the stream mapping, matching the GET handler pattern at line 350. - Cleanup is properly wired —
clearInterval(keepAlive)is called in thecleanup()closure, which is invoked from all three termination paths:closeSSEStream()(line 815),send()when all responses are ready (line 929), andclose()(line 793). - The
catchblock in the interval callback correctly handles writer errors after stream close, preventing runaway intervals.
Minor test observation (non-blocking)
In the cleanup test (line 194), expect(clearIntervalSpy).toHaveBeenCalled() doesn't verify it was called with the specific interval ID from the keepalive. A slightly stricter assertion would capture the return value from the setInterval spy and verify clearInterval was called with that ID. This is minor and the test is still valuable as-is.
Verdict
The fix itself is good — please fix the TypeScript error in the test to get CI green.
| * from closing idle connections during long-running operations. | ||
| */ | ||
| describe("WorkerTransport POST stream keepalive", () => { | ||
| let setIntervalSpy: ReturnType<typeof vi.spyOn>; |
There was a problem hiding this comment.
This type annotation causes the CI typecheck failure (TS2322). vi.spyOn(globalThis, "setInterval") returns a MockInstance with setInterval's overloaded signatures from @cloudflare/workers-types, which is not assignable to the generic ReturnType<typeof vi.spyOn>.
The simplest fix — initialize at declaration and re-spy in beforeEach:
| let setIntervalSpy: ReturnType<typeof vi.spyOn>; | |
| let setIntervalSpy = vi.spyOn(globalThis, "setInterval"); |
Then update beforeEach to restore and re-create:
beforeEach(() => {
setIntervalSpy.mockRestore();
setIntervalSpy = vi.spyOn(globalThis, "setInterval");
});This lets TypeScript infer the correct type from the vi.spyOn call and avoids the generic ReturnType mismatch.
|
Review posted successfully. Here's a summary of what I did: PR #898 Review — Posted a full review + inline suggestion on the CI-failing line. Key findings:
|
|
/bonk fix the build |
There was a problem hiding this comment.
Review Summary
Good fix addressing a real production issue. The POST SSE handler was missing the keepalive ping that the GET handler already has at line 350, causing proxy/LB connection drops during long-running MCP tool calls. The 9-line production code change is correct and minimal.
Production code (worker-transport.ts) — Looks good
The setInterval at line 636 is placed correctly — after creating the writer/encoder, before the stream mapping registration. The cleanup path is properly wired: clearInterval(keepAlive) is called in the cleanup() closure, which is invoked from all three termination paths:
closeSSEStream()(line 815) — explicit stream closesend()when all responses are ready (line 929) — normal completionclose()(line 793) — transport teardown
The catch block in the interval callback correctly handles writer errors after stream close, preventing runaway intervals. This exactly mirrors the existing GET handler pattern.
CI failure — TypeScript error in test file
The check job fails with TS2322 at line 77 of post-keepalive.test.ts. The variable setIntervalSpy is declared as ReturnType<typeof vi.spyOn> (line 18), but vi.spyOn(globalThis, "setInterval") returns a MockInstance with setInterval's overloaded signatures from @cloudflare/workers-types/experimental, which isn't assignable to the generic return type.
Note that the same ReturnType<typeof vi.spyOn> pattern works in client-connection.test.ts for console.error because console.error has a simpler type signature. The setInterval overloads from @cloudflare/workers-types make the types incompatible.
The simplest fix is to initialize the spy at declaration time so TypeScript infers the correct type, then re-spy in beforeEach. See inline suggestion below.
Test improvement suggestion (non-blocking)
The cleanup test at line 194 (expect(clearIntervalSpy).toHaveBeenCalled()) doesn't verify it was called with the specific interval ID from the keepalive. A stricter assertion would capture the return value from the setInterval spy and verify clearInterval was called with that ID. This is minor — the test is still valuable as-is.
Changeset
Correctly included as a patch bump for agents.
packages/agents/src/tests/mcp/transports/post-keepalive.test.ts
Outdated
Show resolved
Hide resolved
|
Review posted on PR #898. Here's a summary of what I found: Production code change ( CI is failing — The |
Co-authored-by: ask-bonk[bot] <249159057+ask-bonk[bot]@users.noreply.github.com>
Summary
The GET SSE handler in
WorkerTransportsets up a 30-second keepalive interval that writesevent: pingto prevent proxies and infrastructure from closing idle connections. The POST SSE handler was missing this keepalive, causing POST response streams to be silently dropped during long-running operations like MCPtools/call.This is a one-line logical fix (adding the same
setIntervalpattern already used in the GET handler) plus proper cleanup in the stream mapping.Problem
When a client sends a POST request that returns an SSE stream (e.g.,
tools/call), the response stream has no keepalive pings. During long-running tool executions (even just a few seconds), proxies, load balancers, and infrastructure can close the idle SSE connection, causing the client to never receive the response.GET handler (line 350) — has keepalive:
POST handler (line 636) — was missing keepalive entirely.
Impact
This affects any MCP server using
WorkerTransportwith Streamable HTTP transport. Clients report errors like:Cloudflare Workers analytics show these as
responseStreamDisconnectedwith short wall times (1-10s), confirming the SSE stream dies during tool execution rather than timing out.Fix
Added the same keepalive
setIntervalto the POST handler, withclearIntervalin the cleanup function. The diff is minimal — it mirrors the existing GET handler pattern exactly.Test plan
post-keepalive.test.tswith 3 tests:npx vitest run --project workers)npm run formatpasses (oxfmt)🤖 Generated with Claude Code