Skip to content

Comments

Fix _Once.ensure() to propagate handshake failure to concurrent waiters#3397

Open
bysiber wants to merge 1 commit intopython-trio:mainfrom
bysiber:fix/ssl-once-ensure-hang-on-failure
Open

Fix _Once.ensure() to propagate handshake failure to concurrent waiters#3397
bysiber wants to merge 1 commit intopython-trio:mainfrom
bysiber:fix/ssl-once-ensure-hang-on-failure

Conversation

@bysiber
Copy link

@bysiber bysiber commented Feb 20, 2026

Problem

_Once.ensure() in SSLStream can leave concurrent waiters hanging forever when the handshake fails.

When two tasks share an SSLStream (one sending, one receiving), both call ensure() to lazily perform the TLS handshake. The first task sets started = True and begins the handshake. The second task sees started=True, finds _done not yet set, and enters _done.wait().

If the handshake fails (certificate error, connection reset, etc.), the exception propagates to the first task — but _done.set() is never called. The second task is stuck forever in _done.wait(): the Event will never be signalled, and started is permanently True, so re-entry won't help either.

Reproduction scenario

  1. Task A calls send_all() → enters ensure(), starts handshake
  2. Task B calls receive_some() → enters ensure(), waits on _done
  3. Handshake fails (remote peer rejects cert)
  4. Task A gets BrokenResourceError — correct
  5. Task B hangs indefinitely

Fix

Store the exception on failure and still signal _done, so that concurrent waiters (and any future callers) wake up and receive a BrokenResourceError chained from the original handshake exception.

When two tasks use an SSLStream concurrently (one sending, one
receiving), both call _Once.ensure() to trigger the lazy handshake.
If the first task starts the handshake and it fails (e.g. certificate
error, connection reset), the exception propagates to that task but
_done is never set.  The second task, already waiting on _done.wait(),
blocks indefinitely — the Event is never signalled and started is
permanently True, so there is no recovery path.

Store the failure exception and set _done even on error, so that
all waiters wake up and receive a BrokenResourceError chained from
the original handshake exception.
@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.97942%. Comparing base (2fd138e) to head (aa3256a).

Files with missing lines Patch % Lines
src/trio/_ssl.py 66.66667% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@                 Coverage Diff                  @@
##                 main       #3397         +/-   ##
====================================================
- Coverage   100.00000%   99.97942%   -0.02058%     
====================================================
  Files             128         128                 
  Lines           19424       19434         +10     
  Branches         1318        1320          +2     
====================================================
+ Hits            19424       19430          +6     
- Misses              0           2          +2     
- Partials            0           2          +2     
Files with missing lines Coverage Δ
src/trio/_ssl.py 98.14815% <66.66667%> (-1.85185%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@A5rocks A5rocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create an issue before making a PR fixing a bug like this? I'd like to confirm that this is a real problem first!

try:
await self._afn(*self._args)
except BaseException as exc:
self._failure = exc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad idea, because a) the stack frames for exc will be mutated, so you will have weird stack traces for raise ... from self._failure and b) this will lead to a refcycle I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants