Skip to content

Probe device for skip on main thread#497

Closed
kapilsh wants to merge 1 commit intopytorch:mainfrom
kapilsh:export-D95934130
Closed

Probe device for skip on main thread#497
kapilsh wants to merge 1 commit intopytorch:mainfrom
kapilsh:export-D95934130

Conversation

@kapilsh
Copy link
Copy Markdown

@kapilsh kapilsh commented Mar 10, 2026

Summary:
See CI signals on PR: #494

Runner: ubuntu-latest is a standard GitHub-hosted runner, which is a vanilla VM with no InfiniBand/RDMA hardware

Build flags: -DUSE_IBVERBS=ON: compiles IBVERBS support and installs libibverbs-dev headers for compilation only

gloo_test crashes with terminate called recursively (exit code 134) on CI runners that compile with IBVERBS/TLS support but lack the corresponding hardware at runtime (e.g., GitHub Actions ubuntu-latest with -DUSE_IBVERBS=ON).

Root cause: BaseTest::spawn() calls GTEST_SKIP() from worker threads when createDevice() returns nullptr for an unavailable transport. GTest assertion/skip macros are not thread-safe — concurrent calls from multiple threads race on GTest's internalTestPartResultReporterInterface, corrupting state. This leads to an exception during stack unwinding, triggering recursive std::terminate().

The crash manifests at the first IBVERBS test case (AllgatherRing/AllgatherTest.VarNumPointer/360) because all prior transport tests (TCP, TCP_LAZY, TCP_TLS) succeed, and IBVERBS is the first transport where createDevice() returns nullptr on a machine without RDMA hardware.

Fix: Probe transport availability from the main test thread before spawning workers. If the transport is unavailable, GTEST_SKIP() is called from the main thread (where it is safe) and the test returns early. Per-thread device creation is preserved for socket address isolation, with a silent early return as a defensive fallback.

Differential Revision: D95934130

@meta-cla meta-cla Bot added the CLA Signed label Mar 10, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 10, 2026

@kapilsh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D95934130.

kapilsh added a commit to kapilsh/gloo that referenced this pull request Mar 10, 2026
…ytorch#497)

Summary:

`gloo_test` crashes with `terminate called recursively` (exit code 134)
on CI runners that compile with IBVERBS support but lack RDMA hardware
at runtime (e.g., GitHub Actions ubuntu-latest with `-DUSE_IBVERBS=ON`).

There are two interacting bugs:

1. **GTEST_SKIP() called from worker threads**: `BaseTest::spawn()` runs
   `contextSize` threads via `spawnThreads()`, and each thread calls
   `GTEST_SKIP()` when `createDevice()` returns nullptr. GTest
   assertion/skip macros are not thread-safe — concurrent calls race on
   GTest's internal `TestPartResultReporterInterface`, corrupting state.

2. **IBVERBS Device destructor throws**: On CI runners with software RDMA
   devices (rxe/siw), `CreateDevice(IBVERBS)` succeeds but device
   teardown fails. `Device::~Device()` uses `GLOO_ENFORCE_EQ` which
   throws `EnforceNotMet` from a destructor — the compiler warns about
   this (`'throw' will always call 'terminate' [-Wterminate]`). This
   makes a "probe and discard" approach unsafe.

Fix: Replace the in-thread `GTEST_SKIP()` with a `std::atomic<bool>`
flag. Worker threads set the flag and return early when the transport is
unavailable. After all threads join, the main thread checks the flag and
calls `GTEST_SKIP()` — which is safe because it runs single-threaded.
This avoids both the GTest data race and any device lifecycle side
effects.

Differential Revision: D95934130
…estructor

Summary:
`gloo_test` crashes with `terminate called recursively` (exit code 134)
on CI runners that compile with IBVERBS support but lack RDMA hardware
at runtime (e.g., GitHub Actions ubuntu-latest with `-DUSE_IBVERBS=ON`).

There are two interacting bugs:

1. **GTEST_SKIP() called from worker threads** (`base_test.h`):
   `BaseTest::spawn()` runs `contextSize` threads via `spawnThreads()`,
   and each thread calls `GTEST_SKIP()` when `createDevice()` returns
   nullptr. GTest assertion/skip macros are not thread-safe — concurrent
   calls race on GTest's internal `TestPartResultReporterInterface`,
   corrupting state.

2. **IBVERBS Device destructor throws** (`ibverbs/device.cc`):
   On CI runners with software RDMA devices (rxe/siw),
   `CreateDevice(IBVERBS)` succeeds but device teardown can fail.
   `Device::~Device()` uses `GLOO_ENFORCE_EQ` which throws
   `EnforceNotMet` from a destructor. In C++11+ this is guaranteed to
   call `std::terminate` — the compiler warns: `'throw' will always
   call 'terminate' [-Wterminate]`.

Fixes:
- **base_test.h**: Replace the in-thread `GTEST_SKIP()` with a
  `std::atomic<bool>` flag. Worker threads set the flag and return
  early when the transport is unavailable. After all threads join, the
  main thread checks the flag and calls `GTEST_SKIP()` — safe because
  it runs single-threaded.
- **ibverbs/device.cc**: Replace `GLOO_ENFORCE_EQ` (which throws) in
  `Device::~Device()` with `GLOO_ERROR` (which logs). Destructors must
  never throw.

Differential Revision: D95934130
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant