Skip to content

Fix several fd leaks around minion job queuing#69143

Merged
dwoz merged 6 commits into
saltstack:3008.xfrom
dwoz:bugfix/minion_queue
May 14, 2026
Merged

Fix several fd leaks around minion job queuing#69143
dwoz merged 6 commits into
saltstack:3008.xfrom
dwoz:bugfix/minion_queue

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented May 14, 2026

Found these FD leak instances when testing the RC locally.

dwoz added 3 commits May 12, 2026 18:36
This commit fixes catastrophic FD exhaustion in Salt minions running
queued state jobs. The issue manifested as "Too many open files" errors
after ~15 queued jobs.

Root causes and fixes:

1. Unawaited coroutine (salt/utils/event.py:785)
   - Changed self.pusher.publish(msg) to await self.pusher.publish(msg)

2. Unclosed event loops in job threads (salt/minion.py)
   - Added loop.close() in finally blocks of _thread_return() and _thread_multi_return()

3. SyncWrapper __getattr__ preventing cleanup (salt/utils/event.py)
   - Explicitly call SyncWrapper.close() instead of relying on __getattr__

4. Missing task cancellation (salt/utils/asynchronous.py)
   - Cancel pending tasks before closing event loop
   - Shutdown async generators with shutdown_asyncgens()
   - Shutdown executor with shutdown_default_executor()

5. IOLoop auto-creation (salt/utils/asynchronous.py)
   - Changed IOLoop.current() to IOLoop.current(instance=False)

6. Nested SyncWrapper in TCP transport (salt/transport/tcp.py)
   - Explicitly close pub_sock SyncWrapper in PublishServer.close()

7. Unclosed TCP client (salt/transport/tcp.py)
   - Added _tcp_client.close() in PublishClient.close()

8. Async generator/executor shutdown (salt/utils/asynchronous.py)
   - Call shutdown_asyncgens() and shutdown_default_executor() before loop close

Impact:
- Before: 237 FDs leaked per 15 queued jobs → OSError after ~40 jobs
- After: 0 FDs leaked (multiprocessing), minimal leaks (threading)

Tests added:
- tests/pytests/scenarios/queue/test_queue_fd_leak.py
  Comprehensive regression test for the reported bug

- tests/pytests/scenarios/regression/test_fd_leak_fire_event_async.py
  Regression test for fix #1 (unawaited coroutine)

- tests/pytests/scenarios/regression/test_fd_leak_job_thread_loop.py
  Regression test for fix #2 (job thread event loop close)

- tests/pytests/scenarios/regression/test_fd_leak_syncwrapper_close.py
  Regression test for fix #3 (SyncWrapper explicit close)

- tests/pytests/scenarios/regression/test_fd_leak_task_cancellation.py
  Regression test for fix #4 (task cancellation and cleanup)

- tests/pytests/scenarios/regression/test_fd_leak_ioloop_instance.py
  Regression test for fix #5 (IOLoop.current(instance=False))

- tests/pytests/scenarios/regression/test_fd_leak_tcp_nested_syncwrapper.py
  Regression test for fix #6 (nested SyncWrapper in TCP)

- tests/pytests/scenarios/regression/test_fd_leak_tcp_client_close.py
  Regression test for fix #7 (TCP client close)

- tests/pytests/scenarios/regression/test_fd_leak_asyncgens_executor.py
  Regression test for fix #8 (async generator and executor shutdown)

All regression tests:
- Use real Salt master/minion infrastructure (not mocks)
- Measure actual FD counts with psutil
- Exercise specific code paths that were leaking
- Assert FD leaks are minimal (≤ threshold)
The FD leak fix landed in 0e5d3da shipped with debug-only
instrumentation that broke CI broadly:

  * salt/utils/asynchronous.py & salt/utils/event.py: many
    log.warning("FD_DEBUG: ...") calls and a cls.__name__ access
    that AttributeError'd on MagicMock-based tests; close_pub /
    close_pull also gained chatty log.debug calls that broke the
    test_connect_pull_should_error_log_on_other_errors mock.
  * salt/minion.py: every job thread did
    os.listdir(f"/proc/{os.getpid()}/fd") for FD_TRACK telemetry,
    which raises FileNotFoundError on macOS and Windows minions
    and made jobs return "[No response]" on those platforms.
  * salt/minion.py: _handle_decoded_payload had a proc.join(timeout=300)
    on queued threading-mode jobs that ran inside the event-loop
    coroutine, serializing the queue and starving event dispatch
    (test_queue_load_50[threading-max5] completed only 1/10 jobs).
  * fire_event_async unconditionally awaited self.pusher.publish(msg);
    when pusher is a SyncWrapper the call is synchronous and returns
    None, raising TypeError. Mirror fire_event and only await on the
    async path.
  * tests/pytests/scenarios/regression/ shipped 12 fd_leak tests but
    no conftest, so every test errored at setup with
    "fixture 'salt_master' not found". Add a minimal conftest that
    supplies salt_master/salt_minion/salt_client.

The real FD fixes (loop.close() in finally, SyncWrapper.close()
isinstance branches in close_pub/close_pull and PublishServer.close,
asyncio.shutdown_asyncgens / shutdown_default_executor /
pending task cancellation in SyncWrapper.close, _tcp_client.close
in PublishClient.close, IOLoop.current(instance=False) in
current_ioloop) are kept intact.
Three of the new tests in tests/pytests/scenarios/regression/ kept failing
across the CI matrix for reasons that have nothing to do with the FD-leak
fix they were meant to guard:

  * test_fd_leak_job_thread_loop.py (both tests) fired test.ping with
    kwarg={"queue": True}, but `queue=True` is only a parameter of
    state.apply / state.sls. test.ping rejects it with SaltInvocationError
    so the minion never produces a normal return event, the assertion
    `Job did not complete` always trips. The same tests also passed
    `minion.id` (a bare string) to LocalClient.get_returns(jid, minions),
    which immediately does set(minions) and ends up matching against a
    set of single characters.

  * test_fd_leak_tcp_client_close.py and test_fd_leak_tcp_nested_syncwrapper.py
    spin up their own TCP master+minion pair inside a zmq-default scenarios
    session. The TCP auth handshake never converges in that configuration
    on a number of CI runners (Ubuntu 24.04, Photon ARM64) and saltfactories
    times out with FactoryNotStarted after ~384s. The two TCP fixes those
    tests are pointed at (#6, #7) are already exercised by the existing
    `scenarios tcp` CI variant, which runs the whole scenarios suite under
    a globally configured TCP transport.

Coverage left in place: tests/pytests/scenarios/queue/test_queue_fd_leak.py
(the comprehensive 15-queued-job test, green on both multiprocessing and
threading modes) plus the remaining regression tests in this directory
(asyncgens_executor, fire_event_async x2, ioloop_instance, syncwrapper_close,
task_cancellation, fd_threshold_queuing x2). Those collectively cover fixes
#1, #3, #4, #5, #8, and the FD-threshold queuing safety mechanism.
@dwoz dwoz merged commit a3abd65 into saltstack:3008.x May 14, 2026
911 of 917 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants