Closed
Conversation
78035f9 to
2eb21de
Compare
kapilsh
added a commit
to kapilsh/gloo
that referenced
this pull request
Mar 10, 2026
…ytorch#497) Summary: `gloo_test` crashes with `terminate called recursively` (exit code 134) on CI runners that compile with IBVERBS support but lack RDMA hardware at runtime (e.g., GitHub Actions ubuntu-latest with `-DUSE_IBVERBS=ON`). There are two interacting bugs: 1. **GTEST_SKIP() called from worker threads**: `BaseTest::spawn()` runs `contextSize` threads via `spawnThreads()`, and each thread calls `GTEST_SKIP()` when `createDevice()` returns nullptr. GTest assertion/skip macros are not thread-safe — concurrent calls race on GTest's internal `TestPartResultReporterInterface`, corrupting state. 2. **IBVERBS Device destructor throws**: On CI runners with software RDMA devices (rxe/siw), `CreateDevice(IBVERBS)` succeeds but device teardown fails. `Device::~Device()` uses `GLOO_ENFORCE_EQ` which throws `EnforceNotMet` from a destructor — the compiler warns about this (`'throw' will always call 'terminate' [-Wterminate]`). This makes a "probe and discard" approach unsafe. Fix: Replace the in-thread `GTEST_SKIP()` with a `std::atomic<bool>` flag. Worker threads set the flag and return early when the transport is unavailable. After all threads join, the main thread checks the flag and calls `GTEST_SKIP()` — which is safe because it runs single-threaded. This avoids both the GTest data race and any device lifecycle side effects. Differential Revision: D95934130
2eb21de to
eb3891e
Compare
…estructor Summary: `gloo_test` crashes with `terminate called recursively` (exit code 134) on CI runners that compile with IBVERBS support but lack RDMA hardware at runtime (e.g., GitHub Actions ubuntu-latest with `-DUSE_IBVERBS=ON`). There are two interacting bugs: 1. **GTEST_SKIP() called from worker threads** (`base_test.h`): `BaseTest::spawn()` runs `contextSize` threads via `spawnThreads()`, and each thread calls `GTEST_SKIP()` when `createDevice()` returns nullptr. GTest assertion/skip macros are not thread-safe — concurrent calls race on GTest's internal `TestPartResultReporterInterface`, corrupting state. 2. **IBVERBS Device destructor throws** (`ibverbs/device.cc`): On CI runners with software RDMA devices (rxe/siw), `CreateDevice(IBVERBS)` succeeds but device teardown can fail. `Device::~Device()` uses `GLOO_ENFORCE_EQ` which throws `EnforceNotMet` from a destructor. In C++11+ this is guaranteed to call `std::terminate` — the compiler warns: `'throw' will always call 'terminate' [-Wterminate]`. Fixes: - **base_test.h**: Replace the in-thread `GTEST_SKIP()` with a `std::atomic<bool>` flag. Worker threads set the flag and return early when the transport is unavailable. After all threads join, the main thread checks the flag and calls `GTEST_SKIP()` — safe because it runs single-threaded. - **ibverbs/device.cc**: Replace `GLOO_ENFORCE_EQ` (which throws) in `Device::~Device()` with `GLOO_ERROR` (which logs). Destructors must never throw. Differential Revision: D95934130
eb3891e to
40ea929
Compare
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
See CI signals on PR: #494
Runner: ubuntu-latest is a standard GitHub-hosted runner, which is a vanilla VM with no InfiniBand/RDMA hardware
Build flags: -DUSE_IBVERBS=ON: compiles IBVERBS support and installs libibverbs-dev headers for compilation only
gloo_testcrashes withterminate called recursively(exit code 134) on CI runners that compile with IBVERBS/TLS support but lack the corresponding hardware at runtime (e.g., GitHub Actions ubuntu-latest with-DUSE_IBVERBS=ON).Root cause:
BaseTest::spawn()callsGTEST_SKIP()from worker threads whencreateDevice()returns nullptr for an unavailable transport. GTest assertion/skip macros are not thread-safe — concurrent calls from multiple threads race on GTest's internalTestPartResultReporterInterface, corrupting state. This leads to an exception during stack unwinding, triggering recursivestd::terminate().The crash manifests at the first IBVERBS test case (
AllgatherRing/AllgatherTest.VarNumPointer/360) because all prior transport tests (TCP, TCP_LAZY, TCP_TLS) succeed, and IBVERBS is the first transport wherecreateDevice()returns nullptr on a machine without RDMA hardware.Fix: Probe transport availability from the main test thread before spawning workers. If the transport is unavailable,
GTEST_SKIP()is called from the main thread (where it is safe) and the test returns early. Per-thread device creation is preserved for socket address isolation, with a silent early return as a defensive fallback.Differential Revision: D95934130