fix: cancel in-flight handlers when transport closes in server.run()#2306
fix: cancel in-flight handlers when transport closes in server.run()#2306
Conversation
When the transport closes (stdin EOF, client disconnect) while a request handler is still running, server.run()'s task group joins on the handler instead of cancelling it. The handler eventually finishes, tries to send its response through a write stream that _receive_loop already closed, and server.run() crashes with ClosedResourceError wrapped in a triple-nested ExceptionGroup. The fix cancels the task group when the incoming_messages loop ends. Handlers receive CancelledError and can clean up in finally blocks. The existing CancelledError catch in _handle_request (added for CancelledNotification handling in #1153) now distinguishes the two cancellation sources: responder.cancel() already sent an error response and we skip the duplicate; transport-close cancellation is re-raised so the task group swallows it. Github-Issue: #526
There was a problem hiding this comment.
No bugs found, but this touches core server infrastructure (the main server.run() loop and the CancelledError handling in _handle_request), with subtle cancel-scope semantics distinguishing two cancellation sources — worth a human confirming the anyio scope behavior is correct.
Extended reasoning...
Overview
This PR fixes a real bug where server.run() would crash with ClosedResourceError when the transport closes while a handler is still running. It makes three changes: (1) adds tg.cancel_scope.cancel() in a finally block in run() so in-flight handlers are cancelled instead of joined, (2) distinguishes CancelledNotification cancellation from TG/transport-close cancellation in _handle_request via message.cancelled (which checks the responder's own cancel scope, not the TG's outer scope), and (3) removes a # pragma: no branch that is no longer valid since _completed can now be False when a handler is cancelled before calling respond().
Security risks
No security risks identified. The change is purely about lifecycle/cleanup behavior.
Level of scrutiny
This modifies the core server.run() message processing loop and the cancellation handling in _handle_request — both critical code paths for every MCP server. The PR author notes this also affects streamable HTTP when terminate() fires mid-request. The cancel-scope distinction (responder's _cancel_scope.cancel_called vs the TG's outer cancel scope) is correct based on anyio semantics, but a human should verify this understanding since getting it wrong could cause handlers to swallow cancellation they should propagate, or vice versa.
Other factors
The PR includes a well-designed test that drives server.run() with raw memory streams to avoid InMemoryTransport masking the bug. The test verifies both that the handler gets cancelled and that server.run() returns cleanly. No CODEOWNERS file exists. The PR description is thorough and references related issues across multiple SDK implementations. No bugs were found by automated analysis.
Two additional races in the same transport-close window as the previous commit, both triggered when handlers are blocked on server-to-client requests (sampling, roots, elicitation) at the moment the transport closes: 1. _receive_loop's finally iterates _response_streams.items() with await checkpoints inside the loop. The woken handler's send_request finally pops from that dict before the iterator's next __next__(), raising RuntimeError: dictionary changed size during iteration. Fix: snapshot with list() before iterating. 2. The woken handler's send_request raises MCPError (CONNECTION_CLOSED), which _handle_request catches and converts to an error response. It then falls through to message.respond() against a write stream that _receive_loop already closed. Fix: catch ClosedResourceError and drop the response. Both reproduce deterministically with two handlers blocked on list_roots() when to_server is closed. Single test covers both: fails 20/20 with either fix reverted, passes 50/50 with both.
src/mcp/server/lowlevel/server.py
Outdated
| except anyio.ClosedResourceError: | ||
| # Transport closed between handler unblocking and respond. Happens | ||
| # when _receive_loop's finally wakes a handler blocked on | ||
| # send_request: the handler runs to respond() before run()'s TG | ||
| # cancel fires, but after _receive_loop closed _write_stream. | ||
| logger.debug("Response for %s dropped - transport closed", message.request_id) |
There was a problem hiding this comment.
🟡 The except anyio.ClosedResourceError at line 495 should also catch anyio.BrokenResourceError. When terminate() in the streamable HTTP transport closes _write_stream_reader (receive end, line 787) before _receive_loop exits its async with self._write_stream block (which closes the send end), a handler that reaches respond() in this window gets BrokenResourceError rather than ClosedResourceError. Fix: except (anyio.BrokenResourceError, anyio.ClosedResourceError):.
Extended reasoning...
What the bug is
The new defensive except anyio.ClosedResourceError around message.respond(response) at line 495 only catches ClosedResourceError, but MemoryObjectSendStream.send() can also raise BrokenResourceError when the receive end of the stream has been closed while the send end remains open. These are semantically different exceptions in anyio: ClosedResourceError means you closed this end while BrokenResourceError means the other end was closed.
The specific code path
In the streamable HTTP transport, terminate() (line 763) closes streams in this order: (1) _read_stream_writer (line 783), causing _receive_loop to see EOF; (2) _write_stream_reader (line 787), closing the receive end of the write stream; (3) _write_stream (line 789), closing the send end. Meanwhile, _receive_loop's finally block (session.py line 418-431) sends CONNECTION_CLOSED errors to pending response streams, waking handlers blocked on send_request(). The woken handler catches the MCPError and falls through to respond() which calls _write_stream.send().
Step-by-step proof
- A handler is blocked in
send_request()(e.g.,ctx.session.list_roots()). terminate()fires from a DELETE request, closes_read_stream_writer._receive_loopsees EOF, enters its finally block, sends CONNECTION_CLOSED to the response stream, waking the handler.terminate()continues and closes_write_stream_reader(receive end) at line 787.- The handler wakes, catches
MCPErrorfrom the CONNECTION_CLOSED response, falls through torespond()at line 495. respond()calls_send_response()then_write_stream.send(). Insidesend(), anyio callsawait checkpoint()thensend_nowait().send_nowait()checksnot self._state.open_receive_channelsand finds it True (receive end closed at step 4), raisingBrokenResourceError.- The
except anyio.ClosedResourceErrorat line 495 does not catchBrokenResourceError. The exception propagates uncaught, crashingserver.run()with anExceptionGroup.
Addressing the refutation
The refutation argues no transport closes the receive end independently during the handler window. This is incorrect for the terminate() path: terminate() is called from an HTTP DELETE handler running in a separate task (the ASGI scope), so it runs concurrently with _receive_loop and the handlers. Steps 4 and 5-6 above interleave at any checkpoint. The refutation correctly notes that the cleanup at line 1050 runs after server.run() returns, but terminate() at line 787 is a different code path that runs from the DELETE handler during server.run(). The refutation also notes the streamable_http.py:1018 pattern is on a different stream (request_streams), which is true, but the exception semantics are the same: the codebase acknowledges that send() on a memory stream can raise either exception type.
Impact and fix
The practical likelihood is low because the race window is narrow: the handler must reach _write_stream.send()'s send_nowait() after terminate() closes the receive end but before _receive_loop exits its async with (which closes the send end). However, the consequence (server crash with ExceptionGroup) is exactly what this PR aims to prevent, so the defense should be complete. The fix is trivially adding BrokenResourceError to the except clause: except (anyio.BrokenResourceError, anyio.ClosedResourceError):.
Python 3.14's compiler attributes the async trampoline's CLEANUP_THROW instructions (for the try-body's await) to the next physical line of code, which was the else body. coverage.py traced a phantom line event there, tripping strict-no-cover even though the else never runs. Moving the try/respond after the if/else avoids the misattribution and also deduplicates the two respond() calls.
streamable_http's terminate() closes _write_stream_reader (the receive end) before _write_stream (the send end). A handler reaching respond() between those two closes gets BrokenResourceError (peer end closed) rather than ClosedResourceError (our end closed). The stdio path only ever hits ClosedResourceError because _receive_loop's async-with closes the send end.
When the transport closes mid-request,
server.run()now cancels in-flight handlers instead of waiting for them.Motivation and Context
Discovered while investigating #2231. That issue's stated symptom (server survives parent death with no tool running) does not reproduce — stdin EOF propagates cleanly through
anyio.wrap_file+TextIOWrapperon all tested Python versions (3.10–3.14). But the investigation surfaced a real adjacent bug:When stdin closes while a tool handler is running,
server.run()'s task group joins on the handler rather than cancelling it. The handler eventually finishes, tries to send its response through a_write_streamthat_receive_loopalready closed, andserver.run()crashes withClosedResourceErrorwrapped in a triple-nestedExceptionGroup.In practice: a stdio server with a 3-second tool gets SIGTERM'd at 2s by the client's graceful-shutdown escalation (
PROCESS_TERMINATION_TIMEOUT), sofinallyblocks never run. With this fix, the handler getsCancelledErrorimmediately and cleanup runs.This is the same bug class as #2257 (merged last week), which fixed one specific write-after-close in
_handle_message. This closes the general case.The existing
CancelledErrorcatch in_handle_request(added in #1153 forCancelledNotification) now distinguishes the two cancellation sources viamessage.cancelled— which checks the responder's own cancel scope, set only byresponder.cancel(). TG cancellation is an outer scope and doesn't set it.May help some #526 cases, but the long-lived orphans reported there are more consistent with the client never closing stdin (client-side bug).
Go SDK has the same wait-then-crash behavior (
conn.go:167); C# fixed it in modelcontextprotocol/csharp-sdk#226.How Has This Been Tested?
tests/server/test_cancel_handling.pydrivesserver.run()with raw memory streams (can't useInMemoryTransport— it wrapsserver.run()in its ownfinally: tg.cancel_scope.cancel()which masks the bug). Fails withTimeoutErroron main, passes with fix.test_server_remains_functional_after_cancelcovers theCancelledNotificationpath unchanged.kill -9parent with tool in progress).Breaking Changes
None. Handlers already needed to be cancel-safe for
CancelledNotification.Types of changes
Checklist
Additional context
session.py:108pragma removed:RequestResponder.__exit__now sees_completed == Falsewhen a handler is cancelled before reachingrespond().Note for reviewers: this also affects streamable HTTP when
terminate()fires mid-request (stateless mode afterhandle_request()returns early, or session DELETE). In-flight tools are now cancelled rather than drained. Since they were previously crashing on the closed write stream anyway, this is strictly better.