Skip to content

🐛 fix(run): break deadlock in execution interrupt chain#3869

Draft
gaborbernat wants to merge 1 commit intotox-dev:mainfrom
gaborbernat:flaky
Draft

🐛 fix(run): break deadlock in execution interrupt chain#3869
gaborbernat wants to merge 1 commit intotox-dev:mainfrom
gaborbernat:flaky

Conversation

@gaborbernat
Copy link
Member

On Windows CI (~1/40 runs), a subprocess can hang indefinitely during environment setup — either in virtualenv's interpreter discovery (Pattern A) or during package installation/provisioning (Pattern B). Analysis of the last 30 days of CI revealed 18 flaky timeout failures across 9 different tests, 89% on windows-2025 and 11% on macos-15. 🪟 The affected tests are not specific — any test that runs tox in-process where a subprocess hangs triggers the same deadlock.

The root cause is an unbreakable deadlock chain in common.py. thread.join() blocks the main thread indefinitely so signals from pytest-timeout can never be delivered. as_completed() blocks the tox-interrupt thread so it can never check the interrupt event. And executor.shutdown(wait=True) prevents done.set() from firing even after an interrupt is acknowledged. For Pattern B, tox_env.interrupt() would kill the hung subprocess since it's tracked in _execute_statuses, but it can never fire because KeyboardInterrupt can't reach the blocked main thread.

Fix Pattern A (discovery) Pattern B (install/provision)
thread.join(timeout=1) loop Allows signal delivery Allows signal delivery
_next_completed with interrupt check Breaks out of wait Breaks out of wait
executor.shutdown(wait=not interrupted) Skips stuck worker Skips stuck worker
daemon=True on thread Process can exit Process can exit
done.wait(timeout=5) Bounded cleanup Bounded cleanup

⏱️ The blocking as_completed() is replaced with a polling _next_completed() that checks the interrupt event every second via concurrent.futures.wait(timeout=1, return_when=FIRST_COMPLETED). The interrupt thread is made daemon so the process can exit if it's stuck. thread.join() uses a timeout loop so signals can be delivered on Windows (where lock.acquire() without timeout ignores signals). The interrupt handler gets bounded waits so cleanup doesn't hang forever.

For Pattern A, the upstream fix is in tox-dev/python-discovery#42 which adds a 5s timeout to process.communicate() in _run_subprocess. Together, these changes eliminate both hang patterns.

@gaborbernat gaborbernat added bug:normal affects many people or has quite an impact os:windows os:macOS labels Mar 7, 2026
@gaborbernat gaborbernat marked this pull request as draft March 7, 2026 03:08
@gaborbernat gaborbernat force-pushed the flaky branch 3 times, most recently from 300cef3 to 1939c55 Compare March 9, 2026 15:56
@rahuldevikar
Copy link
Collaborator

We should do similar change to LocalSubprocessExecuteStatus.wait() in __init__.py so it periodically yeilds control and checks for interrupts instead of blocking indefinitely on WaitForSingleObject

On Windows CI (~1/40 runs), a subprocess can hang indefinitely during
environment setup — either in virtualenv's interpreter discovery or
during package installation/provisioning. This created an unbreakable
deadlock: thread.join() blocked the main thread so signals couldn't be
delivered, as_completed() blocked the interrupt thread so it couldn't
check the interrupt event, and executor.shutdown(wait=True) prevented
done.set() from ever firing.

Replace the blocking as_completed() with a polling _next_completed()
that checks the interrupt event every second, make the interrupt thread
a daemon so the process can exit if it's stuck, use timeout loops for
thread.join() so signals can be delivered, and skip waiting for stuck
workers on shutdown when interrupted. This affected 18 flaky timeouts
across 9 different tests in the last 30 days (89% Windows, 11% macOS).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug:normal affects many people or has quite an impact os:macOS os:windows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants