Fix several fd leaks around minion job queuing#69143
Merged
Merged
Conversation
This commit fixes catastrophic FD exhaustion in Salt minions running queued state jobs. The issue manifested as "Too many open files" errors after ~15 queued jobs. Root causes and fixes: 1. Unawaited coroutine (salt/utils/event.py:785) - Changed self.pusher.publish(msg) to await self.pusher.publish(msg) 2. Unclosed event loops in job threads (salt/minion.py) - Added loop.close() in finally blocks of _thread_return() and _thread_multi_return() 3. SyncWrapper __getattr__ preventing cleanup (salt/utils/event.py) - Explicitly call SyncWrapper.close() instead of relying on __getattr__ 4. Missing task cancellation (salt/utils/asynchronous.py) - Cancel pending tasks before closing event loop - Shutdown async generators with shutdown_asyncgens() - Shutdown executor with shutdown_default_executor() 5. IOLoop auto-creation (salt/utils/asynchronous.py) - Changed IOLoop.current() to IOLoop.current(instance=False) 6. Nested SyncWrapper in TCP transport (salt/transport/tcp.py) - Explicitly close pub_sock SyncWrapper in PublishServer.close() 7. Unclosed TCP client (salt/transport/tcp.py) - Added _tcp_client.close() in PublishClient.close() 8. Async generator/executor shutdown (salt/utils/asynchronous.py) - Call shutdown_asyncgens() and shutdown_default_executor() before loop close Impact: - Before: 237 FDs leaked per 15 queued jobs → OSError after ~40 jobs - After: 0 FDs leaked (multiprocessing), minimal leaks (threading) Tests added: - tests/pytests/scenarios/queue/test_queue_fd_leak.py Comprehensive regression test for the reported bug - tests/pytests/scenarios/regression/test_fd_leak_fire_event_async.py Regression test for fix #1 (unawaited coroutine) - tests/pytests/scenarios/regression/test_fd_leak_job_thread_loop.py Regression test for fix #2 (job thread event loop close) - tests/pytests/scenarios/regression/test_fd_leak_syncwrapper_close.py Regression test for fix #3 (SyncWrapper explicit close) - tests/pytests/scenarios/regression/test_fd_leak_task_cancellation.py Regression test for fix #4 (task cancellation and cleanup) - tests/pytests/scenarios/regression/test_fd_leak_ioloop_instance.py Regression test for fix #5 (IOLoop.current(instance=False)) - tests/pytests/scenarios/regression/test_fd_leak_tcp_nested_syncwrapper.py Regression test for fix #6 (nested SyncWrapper in TCP) - tests/pytests/scenarios/regression/test_fd_leak_tcp_client_close.py Regression test for fix #7 (TCP client close) - tests/pytests/scenarios/regression/test_fd_leak_asyncgens_executor.py Regression test for fix #8 (async generator and executor shutdown) All regression tests: - Use real Salt master/minion infrastructure (not mocks) - Measure actual FD counts with psutil - Exercise specific code paths that were leaking - Assert FD leaks are minimal (≤ threshold)
The FD leak fix landed in 0e5d3da shipped with debug-only instrumentation that broke CI broadly: * salt/utils/asynchronous.py & salt/utils/event.py: many log.warning("FD_DEBUG: ...") calls and a cls.__name__ access that AttributeError'd on MagicMock-based tests; close_pub / close_pull also gained chatty log.debug calls that broke the test_connect_pull_should_error_log_on_other_errors mock. * salt/minion.py: every job thread did os.listdir(f"/proc/{os.getpid()}/fd") for FD_TRACK telemetry, which raises FileNotFoundError on macOS and Windows minions and made jobs return "[No response]" on those platforms. * salt/minion.py: _handle_decoded_payload had a proc.join(timeout=300) on queued threading-mode jobs that ran inside the event-loop coroutine, serializing the queue and starving event dispatch (test_queue_load_50[threading-max5] completed only 1/10 jobs). * fire_event_async unconditionally awaited self.pusher.publish(msg); when pusher is a SyncWrapper the call is synchronous and returns None, raising TypeError. Mirror fire_event and only await on the async path. * tests/pytests/scenarios/regression/ shipped 12 fd_leak tests but no conftest, so every test errored at setup with "fixture 'salt_master' not found". Add a minimal conftest that supplies salt_master/salt_minion/salt_client. The real FD fixes (loop.close() in finally, SyncWrapper.close() isinstance branches in close_pub/close_pull and PublishServer.close, asyncio.shutdown_asyncgens / shutdown_default_executor / pending task cancellation in SyncWrapper.close, _tcp_client.close in PublishClient.close, IOLoop.current(instance=False) in current_ioloop) are kept intact.
Three of the new tests in tests/pytests/scenarios/regression/ kept failing
across the CI matrix for reasons that have nothing to do with the FD-leak
fix they were meant to guard:
* test_fd_leak_job_thread_loop.py (both tests) fired test.ping with
kwarg={"queue": True}, but `queue=True` is only a parameter of
state.apply / state.sls. test.ping rejects it with SaltInvocationError
so the minion never produces a normal return event, the assertion
`Job did not complete` always trips. The same tests also passed
`minion.id` (a bare string) to LocalClient.get_returns(jid, minions),
which immediately does set(minions) and ends up matching against a
set of single characters.
* test_fd_leak_tcp_client_close.py and test_fd_leak_tcp_nested_syncwrapper.py
spin up their own TCP master+minion pair inside a zmq-default scenarios
session. The TCP auth handshake never converges in that configuration
on a number of CI runners (Ubuntu 24.04, Photon ARM64) and saltfactories
times out with FactoryNotStarted after ~384s. The two TCP fixes those
tests are pointed at (#6, #7) are already exercised by the existing
`scenarios tcp` CI variant, which runs the whole scenarios suite under
a globally configured TCP transport.
Coverage left in place: tests/pytests/scenarios/queue/test_queue_fd_leak.py
(the comprehensive 15-queued-job test, green on both multiprocessing and
threading modes) plus the remaining regression tests in this directory
(asyncgens_executor, fire_event_async x2, ioloop_instance, syncwrapper_close,
task_cancellation, fd_threshold_queuing x2). Those collectively cover fixes
#1, #3, #4, #5, #8, and the FD-threshold queuing safety mechanism.
twangboy
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Found these FD leak instances when testing the RC locally.