StreamableHTTPClientTransport: 2-retry SSE reconnect ceiling + silent-success after exhaustion

## Summary

`StreamableHTTPClientTransport` gives up on the GET-SSE response stream after only **2 reconnect retries** (the hard-coded `DEFAULT_STREAMABLE_HTTP_RECONNECTION_OPTIONS.maxRetries: 2`), then leaves the transport in a broken state where `POST` requests succeed at the server but their **JSON-RPC responses can never be delivered to the client** because the SSE channel is dead. Every subsequent tool call hits the client's request `timeoutMs` and surfaces as `The operation timed out.`, even though the server processed the request fine.

This is a **silent-success failure mode** — the user sees timeouts and the server sees successes. Restarting either side (or sending a SIGHUP / `systemctl restart`) is the only recovery, because the dead transport never auto-reconnects.

Witnessed in production 2026-05-14 against an MCP server fronted by CloudFlare Tunnel + nginx. CF Tunnel's default ~100s SSE idle timeout dropped the response stream while a tool call was in flight; SDK's 2-retry SSE reopen failed (with empty error context — see Bug 2 below); and from that point on every subsequent POST silently broke.

## Source

`packages/client/src/client/streamableHttp.ts`, lines 21-26 of latest `main`:

```ts
const DEFAULT_STREAMABLE_HTTP_RECONNECTION_OPTIONS: StreamableHTTPReconnectionOptions = {
    initialReconnectionDelay: 1000,
    maxReconnectionDelay: 30_000,
    reconnectionDelayGrowFactor: 1.5,
    maxRetries: 2  // ← only 2 retries before permanent give-up
};
```

After `maxRetries` is exhausted, `_scheduleReconnection` stops scheduling but the transport's POST path keeps working — sending requests into a void.

## Smoking-gun log (Claude Code 2.1.140 + this SDK)

```
03:57:37  HTTP connection dropped after 191s uptime
03:58:06  Connection error: Streamable HTTP error: Failed to open SSE stream: <none>
03:58:06  Connection error: Failed to reconnect SSE stream: Streamable HTTP error: Failed to open SSE stream: <none>
03:58:06  Terminal connection error 1/3
04:00:12  Connection error: Streamable HTTP error: Failed to open SSE stream: <none>
04:00:12  Connection error: Failed to reconnect SSE stream: ...
04:00:12  Connection error: Maximum reconnection attempts (2) exceeded.
04:00:12  SSE GET-stream reconnection exhausted; leaving transport up (POST still works)
[every subsequent tool call hits client timeoutMs because POSTs land server-side
 fine but responses can't be delivered through the dead SSE channel]
```

(The `leaving transport up (POST still works)` line is from Claude Code's wrapper, but the underlying give-up decision and the missing reconnect-on-next-POST behaviour are in this SDK.)

## Bug 1 — `maxRetries: 2` is too low for production

A single intermediary blip (CF Tunnel idle, nginx upstream timeout, mobile NAT eviction, Wi-Fi handoff) takes more than 2 attempts within ~5 seconds to recover. A more reasonable default would be `maxRetries: 10` with the existing 1.5× backoff capped at 30s — that's ~5 minutes of patient retries before giving up, well within most outage windows.

## Bug 2 — Empty error context

`Failed to open SSE stream: <none>` — the literal string `<none>` (or the underlying empty `response.statusText`) is being captured instead of the real error. Debuggers can't tell whether it was a network failure, a 5xx, an auth issue, or a closed socket. Stringify with `err?.message ?? err?.statusText ?? err?.name ?? "unknown"`.

## Bug 3 — Silent-success after exhaustion (the worst part)

After the SSE GET stream is permanently dead, the transport should EITHER:

- **(A) Reset itself** — mark `_sessionId = undefined`, tear down state, so the next POST attempt re-establishes a fresh transport from scratch. Caller sees one failed call (clear error: "transport reset"), then everything works.
- **(B) Fail-fast on subsequent POSTs** — surface a clear `TransportClosed` error so callers can decide to reconnect or surface to the user. Better than the current "POST returns 202 but the response never arrives" pattern.
- **(C) Keep retrying SSE indefinitely** with the existing exponential backoff (1s, 2s, 4s, 8s, 16s, capped at 30s). Eventually the intermediary settles and a reopen succeeds.

The current behaviour (POSTs work, responses silently lost) is the worst of all worlds.

## Reproducer

1. Start any MCP server with the StreamableHTTP transport behind a proxy that has an SSE idle timeout shorter than your tool-call frequency (CloudFlare Tunnel ~100s default; nginx with `proxy_read_timeout` shorter than your spacing; corporate proxies typically 60-120s).
2. Connect a TS-SDK-based client (Claude Code, Inspector, custom).
3. Stay idle longer than the proxy's SSE timeout.
4. Try to call a tool. POST will land at the server (you'll see it in nginx access log returning 200/202), the server processes it, but the client times out at `timeoutMs` (default 60s).
5. Observe `Maximum reconnection attempts (2) exceeded` in the client transport log — the trigger.
6. Every subsequent tool call same outcome until restart.

## Server-side workaround (already deployed in our environment)

Add an SSE keepalive heartbeat from the server side: write `: keepalive\n\n` (an SSE comment per W3C EventSource §9.2.6) every ~25s on the GET response stream so the intermediary never sees idle. Comments are ignored by SSE clients per the spec, so this can never corrupt a real notification message.

We shipped this in our `mcp-core` wrapper (PR for context: github.com/CloudIngenium/Knowledge-Hub/pull/698). 25s is well under all common intermediary idle timeouts (CF Tunnel ~100s, nginx default 60s, corporate proxies typically 60-120s) and costs ~10 bytes per interval. Could be done by `StreamableHTTPServerTransport` too, but the **client-side bug remains** — any other intermediary cause (TCP RST, client suspend/resume, network reachability flap) still triggers the same unrecoverable wedge.

## Related issues

- #812 — same idle-timeout pattern, closed without a maintainer-led fix (a comment recommended SSE keepalive + #849, but #849 is unrelated to this and was itself closed unmerged).
- #1771 — adjacent: notifications silently dropped when no GET SSE channel; tracks the same "no-one-is-listening" class of bug.
- #949 — adjacent: open issue about "TypeError: terminated" on SSE disconnect, possibly the root cause that triggers the empty-error in this report.

## Environment

- `@modelcontextprotocol/sdk` v1.26.0 (also confirmed present at v1.29.0 / current `main`).
- Claude Code 2.1.140 (claude-vscode wrapper around this SDK).
- Node 24.3.0, Linux x86_64.
- MCP server: behind nginx + CloudFlare Tunnel + Azure AD JWT auth.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamableHTTPClientTransport: 2-retry SSE reconnect ceiling + silent-success after exhaustion #2098

Summary

Source

Smoking-gun log (Claude Code 2.1.140 + this SDK)

Bug 1 — `maxRetries: 2` is too low for production

Bug 2 — Empty error context

Bug 3 — Silent-success after exhaustion (the worst part)

Reproducer

Server-side workaround (already deployed in our environment)

Related issues

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

StreamableHTTPClientTransport: 2-retry SSE reconnect ceiling + silent-success after exhaustion #2098

Description

Summary

Source

Smoking-gun log (Claude Code 2.1.140 + this SDK)

Bug 1 — maxRetries: 2 is too low for production

Bug 2 — Empty error context

Bug 3 — Silent-success after exhaustion (the worst part)

Reproducer

Server-side workaround (already deployed in our environment)

Related issues

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug 1 — `maxRetries: 2` is too low for production