Skip to content

Fix [[nodiscard]] build errors and BUCK deps across comms, gloo, caffe2#494

Open
gyllstromk wants to merge 1 commit intopytorch:mainfrom
gyllstromk:export-D93759269
Open

Fix [[nodiscard]] build errors and BUCK deps across comms, gloo, caffe2#494
gyllstromk wants to merge 1 commit intopytorch:mainfrom
gyllstromk:export-D93759269

Conversation

@gyllstromk
Copy link
Contributor

Summary:
ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:

  • Add (void) casts to suppress [[nodiscard]] warnings across comms/
    (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
  • Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
    with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
    that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269

@meta-cla meta-cla bot added the CLA Signed label Mar 6, 2026
@weifengpy weifengpy requested a review from d4l3k March 6, 2026 20:45
@gyllstromk gyllstromk force-pushed the export-D93759269 branch 2 times, most recently from 63c8173 to 0a5ee91 Compare March 6, 2026 23:59
@meta-codesync
Copy link

meta-codesync bot commented Mar 6, 2026

@gyllstromk has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93759269.

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request Mar 9, 2026
…omms, gloo, caffe2 (#176671)

Summary:
X-link: pytorch/gloo#494

X-link: meta-pytorch/torchcomms#960


ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Test Plan:
## Reproducing the [[nodiscard]] build errors (ROCm 7.0+)

ROCm 7.0 HIP headers annotate CUDA-mapped API functions with
`[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`).
With `-Werror` enabled, any call site that discards the return value
fails to compile:

```
error: ignoring return value of function declared with 'nodiscard'
attribute [-Werror,-Wunused-result]
    cudaSetDevice(device);
    ^~~~~~~~~~~~~~~~~~~~~~
```

To reproduce, build any affected target with ROCm >= 7.0. Example using
the gloo HIP collectives (which calls `cudaSetDevice` and
`cudaDeviceEnablePeerAccess` without checking the return value):

```bash
hipcc -std=c++17 -Werror \
  -I<pytorch_root> -I<rocm_7.0_path>/include \
  -c gloo/cuda_collectives_native.h
# → error: ignoring return value ... [-Werror,-Wunused-result]
```

The fix adds `(void)` casts to explicitly discard the return value,
which is the standard C++ pattern for suppressing `[[nodiscard]]`
warnings. The casts are no-ops on CUDA and older ROCm versions.

## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers)

```bash
# ctran (ibutils.cc, LogInit.cc)
buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib
buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils

# rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h)
buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev

# tcp_devmem (batch_unpack_producer.cc, shared_region.cc)
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common

# gloo (cuda_collectives_native.h)
buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip
```

All targets build successfully. The BUCK dep fixes for
`comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and
`comms/tcp_devmem/unpack` resolve link-time errors that surface when
building under ROCm constraints.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/gloo that referenced this pull request Mar 9, 2026
…e2 (pytorch#494)

Summary:

X-link: meta-pytorch/torchcomms#960

X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/torchcomms that referenced this pull request Mar 9, 2026
…e2 (meta-pytorch#960)

Summary:
X-link: pytorch/gloo#494


X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/pytorch that referenced this pull request Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671)

Summary:
X-link: pytorch/gloo#494

X-link: meta-pytorch/torchcomms#960


ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Test Plan:
## Reproducing the [[nodiscard]] build errors (ROCm 7.0+)

ROCm 7.0 HIP headers annotate CUDA-mapped API functions with
`[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`).
With `-Werror` enabled, any call site that discards the return value
fails to compile:

```
error: ignoring return value of function declared with 'nodiscard'
attribute [-Werror,-Wunused-result]
    cudaSetDevice(device);
    ^~~~~~~~~~~~~~~~~~~~~~
```

To reproduce, build any affected target with ROCm >= 7.0. Example using
the gloo HIP collectives (which calls `cudaSetDevice` and
`cudaDeviceEnablePeerAccess` without checking the return value):

```bash
hipcc -std=c++17 -Werror \
  -I<pytorch_root> -I<rocm_7.0_path>/include \
  -c gloo/cuda_collectives_native.h
# → error: ignoring return value ... [-Werror,-Wunused-result]
```

The fix adds `(void)` casts to explicitly discard the return value,
which is the standard C++ pattern for suppressing `[[nodiscard]]`
warnings. The casts are no-ops on CUDA and older ROCm versions.

## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers)

```bash
# ctran (ibutils.cc, LogInit.cc)
buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib
buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils

# rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h)
buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev

# tcp_devmem (batch_unpack_producer.cc, shared_region.cc)
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common

# gloo (cuda_collectives_native.h)
buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip
```

All targets build successfully. The BUCK dep fixes for
`comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and
`comms/tcp_devmem/unpack` resolve link-time errors that surface when
building under ROCm constraints.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/torchcomms that referenced this pull request Mar 9, 2026
…e2 (meta-pytorch#960)

Summary:
X-link: pytorch/gloo#494

Pull Request resolved: meta-pytorch#960

X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/gloo that referenced this pull request Mar 9, 2026
…e2 (pytorch#494)

Summary:
Pull Request resolved: pytorch#494

X-link: meta-pytorch/torchcomms#960

X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/pytorch that referenced this pull request Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671)

Summary:
X-link: pytorch/gloo#494

X-link: meta-pytorch/torchcomms#960

Pull Request resolved: pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Test Plan:
## Reproducing the [[nodiscard]] build errors (ROCm 7.0+)

ROCm 7.0 HIP headers annotate CUDA-mapped API functions with
`[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`).
With `-Werror` enabled, any call site that discards the return value
fails to compile:

```
error: ignoring return value of function declared with 'nodiscard'
attribute [-Werror,-Wunused-result]
    cudaSetDevice(device);
    ^~~~~~~~~~~~~~~~~~~~~~
```

To reproduce, build any affected target with ROCm >= 7.0. Example using
the gloo HIP collectives (which calls `cudaSetDevice` and
`cudaDeviceEnablePeerAccess` without checking the return value):

```bash
hipcc -std=c++17 -Werror \
  -I<pytorch_root> -I<rocm_7.0_path>/include \
  -c gloo/cuda_collectives_native.h
# → error: ignoring return value ... [-Werror,-Wunused-result]
```

The fix adds `(void)` casts to explicitly discard the return value,
which is the standard C++ pattern for suppressing `[[nodiscard]]`
warnings. The casts are no-ops on CUDA and older ROCm versions.

## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers)

```bash
# ctran (ibutils.cc, LogInit.cc)
buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib
buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils

# rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h)
buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev

# tcp_devmem (batch_unpack_producer.cc, shared_region.cc)
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common

# gloo (cuda_collectives_native.h)
buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip
```

All targets build successfully. The BUCK dep fixes for
`comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and
`comms/tcp_devmem/unpack` resolve link-time errors that surface when
building under ROCm constraints.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/gloo that referenced this pull request Mar 9, 2026
…e2 (pytorch#494)

Summary:

X-link: meta-pytorch/torchcomms#960

X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/gloo that referenced this pull request Mar 9, 2026
…e2 (pytorch#494)

Summary:

X-link: meta-pytorch/torchcomms#960

X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/gloo that referenced this pull request Mar 9, 2026
…e2 (pytorch#494)

Summary:

X-link: meta-pytorch/torchcomms#960

X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/torchcomms that referenced this pull request Mar 9, 2026
…e2 (meta-pytorch#960)

Summary:
X-link: pytorch/gloo#494


X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/pytorch that referenced this pull request Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671)

Summary:
X-link: pytorch/gloo#494

X-link: meta-pytorch/torchcomms#960


ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Test Plan:
## Reproducing the [[nodiscard]] build errors (ROCm 7.0+)

ROCm 7.0 HIP headers annotate CUDA-mapped API functions with
`[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`).
With `-Werror` enabled, any call site that discards the return value
fails to compile:

```
error: ignoring return value of function declared with 'nodiscard'
attribute [-Werror,-Wunused-result]
    cudaSetDevice(device);
    ^~~~~~~~~~~~~~~~~~~~~~
```

To reproduce, build any affected target with ROCm >= 7.0. Example using
the gloo HIP collectives (which calls `cudaSetDevice` and
`cudaDeviceEnablePeerAccess` without checking the return value):

```bash
hipcc -std=c++17 -Werror \
  -I<pytorch_root> -I<rocm_7.0_path>/include \
  -c gloo/cuda_collectives_native.h
# → error: ignoring return value ... [-Werror,-Wunused-result]
```

The fix adds `(void)` casts to explicitly discard the return value,
which is the standard C++ pattern for suppressing `[[nodiscard]]`
warnings. The casts are no-ops on CUDA and older ROCm versions.

## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers)

```bash
# ctran (ibutils.cc, LogInit.cc)
buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib
buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils

# rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h)
buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev

# tcp_devmem (batch_unpack_producer.cc, shared_region.cc)
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common

# gloo (cuda_collectives_native.h)
buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip
```

All targets build successfully. The BUCK dep fixes for
`comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and
`comms/tcp_devmem/unpack` resolve link-time errors that surface when
building under ROCm constraints.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/pytorch that referenced this pull request Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671)

Summary:
X-link: pytorch/gloo#494

X-link: meta-pytorch/torchcomms#960


ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Test Plan:
## Reproducing the [[nodiscard]] build errors (ROCm 7.0+)

ROCm 7.0 HIP headers annotate CUDA-mapped API functions with
`[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`).
With `-Werror` enabled, any call site that discards the return value
fails to compile:

```
error: ignoring return value of function declared with 'nodiscard'
attribute [-Werror,-Wunused-result]
    cudaSetDevice(device);
    ^~~~~~~~~~~~~~~~~~~~~~
```

To reproduce, build any affected target with ROCm >= 7.0. Example using
the gloo HIP collectives (which calls `cudaSetDevice` and
`cudaDeviceEnablePeerAccess` without checking the return value):

```bash
hipcc -std=c++17 -Werror \
  -I<pytorch_root> -I<rocm_7.0_path>/include \
  -c gloo/cuda_collectives_native.h
# → error: ignoring return value ... [-Werror,-Wunused-result]
```

The fix adds `(void)` casts to explicitly discard the return value,
which is the standard C++ pattern for suppressing `[[nodiscard]]`
warnings. The casts are no-ops on CUDA and older ROCm versions.

## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers)

```bash
# ctran (ibutils.cc, LogInit.cc)
buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib
buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils

# rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h)
buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev

# tcp_devmem (batch_unpack_producer.cc, shared_region.cc)
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common

# gloo (cuda_collectives_native.h)
buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip
```

All targets build successfully. The BUCK dep fixes for
`comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and
`comms/tcp_devmem/unpack` resolve link-time errors that surface when
building under ROCm constraints.

Reviewed By: bbeckca

Differential Revision: D93759269
…e2 (pytorch#494)

Summary:
Pull Request resolved: pytorch#494

X-link: meta-pytorch/torchcomms#960

X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/torchcomms that referenced this pull request Mar 9, 2026
…e2 (meta-pytorch#960)

Summary:
X-link: pytorch/gloo#494

Pull Request resolved: meta-pytorch#960

X-link: pytorch/pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Reviewed By: bbeckca

Differential Revision: D93759269
gyllstromk added a commit to gyllstromk/pytorch that referenced this pull request Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671)

Summary:
X-link: pytorch/gloo#494

X-link: meta-pytorch/torchcomms#960

Pull Request resolved: pytorch#176671

ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.

Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.

Changes:
- Add (void) casts to suppress [[nodiscard]] warnings across comms/
  (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
- Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client
  with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
  that surface when building these targets under ROCm constraints

The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.

Test Plan:
## Reproducing the [[nodiscard]] build errors (ROCm 7.0+)

ROCm 7.0 HIP headers annotate CUDA-mapped API functions with
`[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`).
With `-Werror` enabled, any call site that discards the return value
fails to compile:

```
error: ignoring return value of function declared with 'nodiscard'
attribute [-Werror,-Wunused-result]
    cudaSetDevice(device);
    ^~~~~~~~~~~~~~~~~~~~~~
```

To reproduce, build any affected target with ROCm >= 7.0. Example using
the gloo HIP collectives (which calls `cudaSetDevice` and
`cudaDeviceEnablePeerAccess` without checking the return value):

```bash
hipcc -std=c++17 -Werror \
  -I<pytorch_root> -I<rocm_7.0_path>/include \
  -c gloo/cuda_collectives_native.h
# → error: ignoring return value ... [-Werror,-Wunused-result]
```

The fix adds `(void)` casts to explicitly discard the return value,
which is the standard C++ pattern for suppressing `[[nodiscard]]`
warnings. The casts are no-ops on CUDA and older ROCm versions.

## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers)

```bash
# ctran (ibutils.cc, LogInit.cc)
buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib
buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils

# rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h)
buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev

# tcp_devmem (batch_unpack_producer.cc, shared_region.cc)
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer
buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common

# gloo (cuda_collectives_native.h)
buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip
```

All targets build successfully. The BUCK dep fixes for
`comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and
`comms/tcp_devmem/unpack` resolve link-time errors that surface when
building under ROCm constraints.

Reviewed By: bbeckca

Differential Revision: D93759269
kapilsh added a commit to kapilsh/gloo that referenced this pull request Mar 10, 2026
Summary:
See CI signals on PR: pytorch#494

Runner: ubuntu-latest is a standard GitHub-hosted runner, which is a vanilla VM with no InfiniBand/RDMA hardware
 
Build flags: -DUSE_IBVERBS=ON: compiles IBVERBS support and installs libibverbs-dev headers for compilation only

`gloo_test` crashes with `terminate called recursively` (exit code 134) on CI runners that compile with IBVERBS/TLS support but lack the corresponding hardware at runtime (e.g., GitHub Actions ubuntu-latest with `-DUSE_IBVERBS=ON`).

**Root cause:** `BaseTest::spawn()` calls `GTEST_SKIP()` from worker threads when `createDevice()` returns nullptr for an unavailable transport. GTest assertion/skip macros are not thread-safe — concurrent calls from multiple threads race on GTest's internal`TestPartResultReporterInterface`, corrupting state. This leads to an exception during stack unwinding, triggering recursive `std::terminate()`.

The crash manifests at the first IBVERBS test case (`AllgatherRing/AllgatherTest.VarNumPointer/360`) because all prior transport tests (TCP, TCP_LAZY, TCP_TLS) succeed, and IBVERBS is the first transport where `createDevice()` returns nullptr on a machine without RDMA hardware.

**Fix:** Probe transport availability from the main test thread before spawning workers. If the transport is unavailable, `GTEST_SKIP()` is called from the main thread (where it is safe) and the test returns early. Per-thread device creation is preserved for socket address isolation, with a silent early return as a defensive fallback.

Differential Revision: D95934130
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant