🐛 fix(run): break deadlock in execution interrupt chain#3869
Draft
gaborbernat wants to merge 1 commit intotox-dev:mainfrom
Draft
🐛 fix(run): break deadlock in execution interrupt chain#3869gaborbernat wants to merge 1 commit intotox-dev:mainfrom
gaborbernat wants to merge 1 commit intotox-dev:mainfrom
Conversation
300cef3 to
1939c55
Compare
rahuldevikar
approved these changes
Mar 9, 2026
Collaborator
|
We should do similar change to |
On Windows CI (~1/40 runs), a subprocess can hang indefinitely during environment setup — either in virtualenv's interpreter discovery or during package installation/provisioning. This created an unbreakable deadlock: thread.join() blocked the main thread so signals couldn't be delivered, as_completed() blocked the interrupt thread so it couldn't check the interrupt event, and executor.shutdown(wait=True) prevented done.set() from ever firing. Replace the blocking as_completed() with a polling _next_completed() that checks the interrupt event every second, make the interrupt thread a daemon so the process can exit if it's stuck, use timeout loops for thread.join() so signals can be delivered, and skip waiting for stuck workers on shutdown when interrupted. This affected 18 flaky timeouts across 9 different tests in the last 30 days (89% Windows, 11% macOS).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On Windows CI (~1/40 runs), a subprocess can hang indefinitely during environment setup — either in virtualenv's interpreter discovery (Pattern A) or during package installation/provisioning (Pattern B). Analysis of the last 30 days of CI revealed 18 flaky timeout failures across 9 different tests, 89% on
windows-2025and 11% onmacos-15. 🪟 The affected tests are not specific — any test that runs tox in-process where a subprocess hangs triggers the same deadlock.The root cause is an unbreakable deadlock chain in
common.py.thread.join()blocks the main thread indefinitely so signals frompytest-timeoutcan never be delivered.as_completed()blocks thetox-interruptthread so it can never check theinterruptevent. Andexecutor.shutdown(wait=True)preventsdone.set()from firing even after an interrupt is acknowledged. For Pattern B,tox_env.interrupt()would kill the hung subprocess since it's tracked in_execute_statuses, but it can never fire becauseKeyboardInterruptcan't reach the blocked main thread.thread.join(timeout=1)loop_next_completedwith interrupt checkexecutor.shutdown(wait=not interrupted)daemon=Trueon threaddone.wait(timeout=5)⏱️ The blocking
as_completed()is replaced with a polling_next_completed()that checks theinterruptevent every second viaconcurrent.futures.wait(timeout=1, return_when=FIRST_COMPLETED). The interrupt thread is made daemon so the process can exit if it's stuck.thread.join()uses a timeout loop so signals can be delivered on Windows (wherelock.acquire()without timeout ignores signals). The interrupt handler gets bounded waits so cleanup doesn't hang forever.For Pattern A, the upstream fix is in tox-dev/python-discovery#42 which adds a 5s timeout to
process.communicate()in_run_subprocess. Together, these changes eliminate both hang patterns.