Skip to content

fix(shim): resolve intermittent deadlocks and hangs in monitor module#458

Closed
novahe wants to merge 1 commit intocontainerd:mainfrom
novahe:optimize-shim-monitor
Closed

fix(shim): resolve intermittent deadlocks and hangs in monitor module#458
novahe wants to merge 1 commit intocontainerd:mainfrom
novahe:optimize-shim-monitor

Conversation

@novahe
Copy link

@novahe novahe commented Mar 14, 2026

Description

This PR addresses intermittent hangs and potential deadlocks in the containerd-shim monitor module by optimizing the locking mechanism and event distribution logic.

The Problem

The previous implementation suffered from two main issues:

  1. Async Lock in Destructors: The global MONITOR used tokio::sync::Mutex. When a Subscription was dropped, it attempted to acquire this lock to unsubscribe. If the drop occurred outside an active Tokio
    runtime (e.g., during shim shutdown or from a blocking thread), it caused panics or hung the process.
  2. Reaper Thread Blocking: Using bounded channels for exit events meant that the "Reaper" thread (responsible for SIGCHLD and waitpid) could block if a subscriber was slow to consume events. This prevented
    other process exits from being collected, leading to zombie processes and a total shim hang under heavy workloads.

The Solution

  1. Synchronous Locking: Replaced tokio::sync::Mutex with std::sync::Mutex for the global MONITOR singleton. This ensures that Subscription::drop can safely and synchronously unregister itself in any
    execution context without relying on the async runtime.
  2. Unbounded Channels: Switched to unbounded channels (tokio::sync::mpsc::unbounded_channel for async and std::sync::mpsc::channel for sync). Since PID exit events are critical and must not be lost,
    unbounded channels ensure that the producer (the Reaper thread) is never blocked by subscriber backpressure.
  3. Strict FIFO Ordering: By using unbounded channels, the send operation is now a non-blocking memory operation. This allowed moving the notification logic back inside the lock, ensuring strict FIFO
    ordering of events and providing a guarantee that no events are delivered after a successful unsubscription.

This commit addresses several deadlock and hang scenarios in the monitor module:
- Replaced tokio::sync::Mutex with std::sync::Mutex to allow safe Drop implementation.
- Switched to unbounded channels to prevent Reaper thread blocking on subscriber backpressure.
- Restored in-lock notification to ensure strict FIFO ordering and unsubscription consistency.
@github-actions github-actions bot added the C-shim Containerd shim label Mar 14, 2026
@novahe novahe closed this Mar 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-shim Containerd shim

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant