Skip to content

fix: add keepalive ping to POST SSE response streams#898

Merged
whoiskatrin merged 2 commits intocloudflare:mainfrom
jvg123:fix/post-stream-keepalive
Feb 13, 2026
Merged

fix: add keepalive ping to POST SSE response streams#898
whoiskatrin merged 2 commits intocloudflare:mainfrom
jvg123:fix/post-stream-keepalive

Conversation

@jvg123
Copy link
Contributor

@jvg123 jvg123 commented Feb 12, 2026

Summary

The GET SSE handler in WorkerTransport sets up a 30-second keepalive interval that writes event: ping to prevent proxies and infrastructure from closing idle connections. The POST SSE handler was missing this keepalive, causing POST response streams to be silently dropped during long-running operations like MCP tools/call.

This is a one-line logical fix (adding the same setInterval pattern already used in the GET handler) plus proper cleanup in the stream mapping.

Problem

When a client sends a POST request that returns an SSE stream (e.g., tools/call), the response stream has no keepalive pings. During long-running tool executions (even just a few seconds), proxies, load balancers, and infrastructure can close the idle SSE connection, causing the client to never receive the response.

GET handler (line 350) — has keepalive:

const keepAlive = setInterval(() => {
  try {
    writer.write(encoder.encode("event: ping\ndata: \n\n"));
  } catch {
    clearInterval(keepAlive);
  }
}, 30000);

POST handler (line 636) — was missing keepalive entirely.

Impact

This affects any MCP server using WorkerTransport with Streamable HTTP transport. Clients report errors like:

  • Claude: "No result received from client-side tool execution"
  • ChatGPT: "TypeError: fetch failed"
  • MCP SDK: "MCP error -32001: Request Timeout"

Cloudflare Workers analytics show these as responseStreamDisconnected with short wall times (1-10s), confirming the SSE stream dies during tool execution rather than timing out.

Fix

Added the same keepalive setInterval to the POST handler, with clearInterval in the cleanup function. The diff is minimal — it mirrors the existing GET handler pattern exactly.

Test plan

  • Added post-keepalive.test.ts with 3 tests:
    • GET SSE stream sets up keepalive interval (existing behavior, passes before and after)
    • POST SSE stream sets up keepalive interval (failed before fix, passes after)
    • POST keepalive cleanup clears the interval on stream close
  • All 467 existing tests still pass (npx vitest run --project workers)
  • npm run format passes (oxfmt)

🤖 Generated with Claude Code

The GET SSE handler sets up a 30-second keepalive interval that writes
"event: ping" to prevent proxies from closing idle connections. The POST
SSE handler was missing this keepalive, causing responses to be silently
dropped during long-running tool calls.

This adds the same keepalive pattern to the POST handler and includes
tests verifying both GET and POST streams have keepalive, and that the
interval is properly cleaned up.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@changeset-bot
Copy link

changeset-bot bot commented Feb 12, 2026

🦋 Changeset detected

Latest commit: bbafdfc

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@whoiskatrin
Copy link
Contributor

/bonk review this PR

Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This is a clean, well-motivated bug fix. The problem is real and clearly articulated — the GET SSE handler has a 30-second keepalive ping to prevent proxies/load balancers from closing idle connections, but the POST SSE handler was missing it entirely. This causes POST response streams (e.g., tools/call) to be silently dropped during long-running operations.

Code change

The fix at packages/agents/src/mcp/worker-transport.ts:636-651 is correct:

  1. The setInterval placement is right — it's set up immediately after creating the writer/encoder and before the stream mapping, exactly mirroring the GET handler pattern (line 350).
  2. Cleanup is correctclearInterval(keepAlive) is called in the cleanup() closure, which is invoked via:
    • closeSSEStream() (line 815) — explicit stream close
    • send() when all responses are ready (line 929) — normal completion
    • close() (line 793) — transport teardown
  3. The catch block in the interval handles the case where writer.write() throws after the stream is already closed, preventing the interval from running forever. This matches the GET handler exactly.

Tests

The three test cases are well-structured:

  • GET keepalive exists (regression guard for existing behavior)
  • POST keepalive exists (the actual fix verification)
  • POST keepalive cleanup on closeSSEStream (verifies no leaked intervals)

The tests use vi.spyOn(globalThis, "setInterval") to verify the interval is created with the correct 30000ms delay, which is a reasonable approach for this kind of infrastructure behavior.

Changeset

Correctly included as a patch bump for agents.

One minor observation (non-blocking)

The clearIntervalSpy assertion in the cleanup test (expect(clearIntervalSpy).toHaveBeenCalled()) doesn't verify it was called with the specific interval ID from the keepalive. Since clearInterval may also be called from other internal cleanup paths, this assertion could theoretically pass for the wrong reason. A stricter assertion would capture the interval ID from the setInterval spy and assert clearIntervalSpy was called with that specific ID. This is minor — the test is still valuable as-is.

LGTM. This is a minimal, targeted fix that mirrors an existing pattern and addresses a real production issue affecting MCP clients.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Feb 13, 2026

Review submitted. Here's a summary of my findings:

PR #898 — Approved

This is a correct, minimal bug fix that adds a 30-second keepalive ping (event: ping) to POST SSE response streams in WorkerTransport, mirroring the existing pattern already used in the GET handler. The fix addresses a real production issue where proxies and load balancers silently drop idle POST SSE connections during long-running MCP tool calls.

The code change is 9 lines, correctly placed, with proper cleanup in all three paths (closeSSEStream, send completion, and close). Tests are well-structured and the changeset is included as a patch bump.

I noted one minor non-blocking observation: the cleanup test could be stricter by asserting clearInterval was called with the specific interval ID rather than just asserting it was called at all.

github run

@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 13, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/agents@898

commit: bbafdfc

Copy link
Contributor

@mattzcarey mattzcarey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM not sure how I missed that

@mattzcarey
Copy link
Contributor

This could be the cause of #858

@mattzcarey
Copy link
Contributor

@bonk can you make sure check script passes. Make the smallest change possible and commit it.

@whoiskatrin
Copy link
Contributor

/bonk fix the npm run check issue

Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of PR #898: fix: add keepalive ping to POST SSE response streams

Summary

The core fix is correct and well-motivated. The POST SSE handler was missing the keepalive ping that the GET handler already has, causing proxy/LB connection drops during long-running tool calls. The 9-line change to worker-transport.ts correctly mirrors the existing GET handler pattern.

CI Failure

The check job is failing due to a TypeScript error in the new test file:

packages/agents/src/tests/mcp/transports/post-keepalive.test.ts(77,5): error TS2322:
Type 'MockInstance<{ (handler: TimerHandler, timeout?: number | undefined, ...arguments: any[]): number; ...' 
is not assignable to type 'MockInstance<(this: unknown, ...args: unknown[]) => unknown>'.

The issue is at line 18/77 in the test file. The variable is declared as:

let setIntervalSpy: ReturnType<typeof vi.spyOn>;

But vi.spyOn(globalThis, "setInterval") returns a more specific MockInstance type (matching setInterval's overloaded signatures in @cloudflare/workers-types), which doesn't assignably match the generic ReturnType<typeof vi.spyOn>.

Fix: Use the inferred type directly instead of the generic ReturnType:

// Option A: let TypeScript infer from the assignment (move init into declaration)
let setIntervalSpy = vi.spyOn(globalThis, "setInterval");

// Option B: explicitly type it to allow any mock instance
// biome-ignore lint: complex overload type
let setIntervalSpy: ReturnType<typeof vi.spyOn<typeof globalThis, "setInterval">>;

The simplest fix is to initialize the spy at declaration time and restore it in afterEach, or use a type assertion at the assignment site:

setIntervalSpy = vi.spyOn(globalThis, "setInterval") as typeof setIntervalSpy;

Code change review (worker-transport.ts)

The production code change at lines 636-651 is correct:

  1. The setInterval placement is right — immediately after creating the writer/encoder, before the stream mapping, matching the GET handler pattern at line 350.
  2. Cleanup is properly wired — clearInterval(keepAlive) is called in the cleanup() closure, which is invoked from all three termination paths: closeSSEStream() (line 815), send() when all responses are ready (line 929), and close() (line 793).
  3. The catch block in the interval callback correctly handles writer errors after stream close, preventing runaway intervals.

Minor test observation (non-blocking)

In the cleanup test (line 194), expect(clearIntervalSpy).toHaveBeenCalled() doesn't verify it was called with the specific interval ID from the keepalive. A slightly stricter assertion would capture the return value from the setInterval spy and verify clearInterval was called with that ID. This is minor and the test is still valuable as-is.

Verdict

The fix itself is good — please fix the TypeScript error in the test to get CI green.

* from closing idle connections during long-running operations.
*/
describe("WorkerTransport POST stream keepalive", () => {
let setIntervalSpy: ReturnType<typeof vi.spyOn>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type annotation causes the CI typecheck failure (TS2322). vi.spyOn(globalThis, "setInterval") returns a MockInstance with setInterval's overloaded signatures from @cloudflare/workers-types, which is not assignable to the generic ReturnType<typeof vi.spyOn>.

The simplest fix — initialize at declaration and re-spy in beforeEach:

Suggested change
let setIntervalSpy: ReturnType<typeof vi.spyOn>;
let setIntervalSpy = vi.spyOn(globalThis, "setInterval");

Then update beforeEach to restore and re-create:

beforeEach(() => {
  setIntervalSpy.mockRestore();
  setIntervalSpy = vi.spyOn(globalThis, "setInterval");
});

This lets TypeScript infer the correct type from the vi.spyOn call and avoids the generic ReturnType mismatch.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Feb 13, 2026

Review posted successfully. Here's a summary of what I did:

PR #898 Review — Posted a full review + inline suggestion on the CI-failing line.

Key findings:

  1. Production code change is correct — The 9-line addition to worker-transport.ts:636-651 correctly adds a 30s keepalive ping to POST SSE streams, matching the existing GET handler pattern. Cleanup is properly wired through all three termination paths (closeSSEStream, send completion, close).

  2. CI is failing due to a TypeScript error in the test file at line 18. The variable setIntervalSpy is typed as ReturnType<typeof vi.spyOn>, but vi.spyOn(globalThis, "setInterval") returns a more specific MockInstance type from @cloudflare/workers-types that isn't assignable to the generic type. I posted an inline suggestion with a fix: initialize the spy at declaration time so TypeScript can infer the correct type.

  3. Minor non-blocking observation — The cleanup test could assert that clearInterval was called with the specific interval ID rather than just asserting it was called at all.

github run

@whoiskatrin
Copy link
Contributor

/bonk fix the build

Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Good fix addressing a real production issue. The POST SSE handler was missing the keepalive ping that the GET handler already has at line 350, causing proxy/LB connection drops during long-running MCP tool calls. The 9-line production code change is correct and minimal.

Production code (worker-transport.ts) — Looks good

The setInterval at line 636 is placed correctly — after creating the writer/encoder, before the stream mapping registration. The cleanup path is properly wired: clearInterval(keepAlive) is called in the cleanup() closure, which is invoked from all three termination paths:

  • closeSSEStream() (line 815) — explicit stream close
  • send() when all responses are ready (line 929) — normal completion
  • close() (line 793) — transport teardown

The catch block in the interval callback correctly handles writer errors after stream close, preventing runaway intervals. This exactly mirrors the existing GET handler pattern.

CI failure — TypeScript error in test file

The check job fails with TS2322 at line 77 of post-keepalive.test.ts. The variable setIntervalSpy is declared as ReturnType<typeof vi.spyOn> (line 18), but vi.spyOn(globalThis, "setInterval") returns a MockInstance with setInterval's overloaded signatures from @cloudflare/workers-types/experimental, which isn't assignable to the generic return type.

Note that the same ReturnType<typeof vi.spyOn> pattern works in client-connection.test.ts for console.error because console.error has a simpler type signature. The setInterval overloads from @cloudflare/workers-types make the types incompatible.

The simplest fix is to initialize the spy at declaration time so TypeScript infers the correct type, then re-spy in beforeEach. See inline suggestion below.

Test improvement suggestion (non-blocking)

The cleanup test at line 194 (expect(clearIntervalSpy).toHaveBeenCalled()) doesn't verify it was called with the specific interval ID from the keepalive. A stricter assertion would capture the return value from the setInterval spy and verify clearInterval was called with that ID. This is minor — the test is still valuable as-is.

Changeset

Correctly included as a patch bump for agents.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Feb 13, 2026

Review posted on PR #898. Here's a summary of what I found:

Production code change (worker-transport.ts:636-651) — Correct and minimal. Adds a 30s keepalive ping to POST SSE streams, mirroring the existing GET handler pattern at line 350. Cleanup is properly wired through all three termination paths (closeSSEStream, send, close).

CI is failing — The check job fails with a TS2322 TypeScript error at line 77 of post-keepalive.test.ts. The setIntervalSpy variable is typed as ReturnType<typeof vi.spyOn>, but vi.spyOn(globalThis, "setInterval") returns a more specific MockInstance type due to setInterval's overloaded signatures in @cloudflare/workers-types/experimental. I posted an inline suggestion: initialize the spy at declaration time (let setIntervalSpy = vi.spyOn(...)) so TypeScript can infer the correct type, then restore and re-create in beforeEach.

github run

Co-authored-by: ask-bonk[bot] <249159057+ask-bonk[bot]@users.noreply.github.com>
@whoiskatrin whoiskatrin merged commit cd2d34f into cloudflare:main Feb 13, 2026
4 checks passed
@github-actions github-actions bot mentioned this pull request Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants