Fix [[nodiscard]] build errors and BUCK deps across comms, gloo, caffe2#494
Open
gyllstromk wants to merge 1 commit intopytorch:mainfrom
Open
Fix [[nodiscard]] build errors and BUCK deps across comms, gloo, caffe2#494gyllstromk wants to merge 1 commit intopytorch:mainfrom
gyllstromk wants to merge 1 commit intopytorch:mainfrom
Conversation
63c8173 to
0a5ee91
Compare
|
@gyllstromk has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93759269. |
0a5ee91 to
046964b
Compare
pytorch-bot bot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Mar 9, 2026
…omms, gloo, caffe2 (#176671) Summary: X-link: pytorch/gloo#494 X-link: meta-pytorch/torchcomms#960 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Test Plan: ## Reproducing the [[nodiscard]] build errors (ROCm 7.0+) ROCm 7.0 HIP headers annotate CUDA-mapped API functions with `[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`). With `-Werror` enabled, any call site that discards the return value fails to compile: ``` error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result] cudaSetDevice(device); ^~~~~~~~~~~~~~~~~~~~~~ ``` To reproduce, build any affected target with ROCm >= 7.0. Example using the gloo HIP collectives (which calls `cudaSetDevice` and `cudaDeviceEnablePeerAccess` without checking the return value): ```bash hipcc -std=c++17 -Werror \ -I<pytorch_root> -I<rocm_7.0_path>/include \ -c gloo/cuda_collectives_native.h # → error: ignoring return value ... [-Werror,-Wunused-result] ``` The fix adds `(void)` casts to explicitly discard the return value, which is the standard C++ pattern for suppressing `[[nodiscard]]` warnings. The casts are no-ops on CUDA and older ROCm versions. ## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers) ```bash # ctran (ibutils.cc, LogInit.cc) buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils # rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h) buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev # tcp_devmem (batch_unpack_producer.cc, shared_region.cc) buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common # gloo (cuda_collectives_native.h) buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip ``` All targets build successfully. The BUCK dep fixes for `comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and `comms/tcp_devmem/unpack` resolve link-time errors that surface when building under ROCm constraints. Reviewed By: bbeckca Differential Revision: D93759269
gyllstromk
added a commit
to gyllstromk/gloo
that referenced
this pull request
Mar 9, 2026
…e2 (pytorch#494) Summary: X-link: meta-pytorch/torchcomms#960 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
046964b to
d0f6d20
Compare
gyllstromk
added a commit
to gyllstromk/torchcomms
that referenced
this pull request
Mar 9, 2026
…e2 (meta-pytorch#960) Summary: X-link: pytorch/gloo#494 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
gyllstromk
added a commit
to gyllstromk/pytorch
that referenced
this pull request
Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671) Summary: X-link: pytorch/gloo#494 X-link: meta-pytorch/torchcomms#960 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Test Plan: ## Reproducing the [[nodiscard]] build errors (ROCm 7.0+) ROCm 7.0 HIP headers annotate CUDA-mapped API functions with `[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`). With `-Werror` enabled, any call site that discards the return value fails to compile: ``` error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result] cudaSetDevice(device); ^~~~~~~~~~~~~~~~~~~~~~ ``` To reproduce, build any affected target with ROCm >= 7.0. Example using the gloo HIP collectives (which calls `cudaSetDevice` and `cudaDeviceEnablePeerAccess` without checking the return value): ```bash hipcc -std=c++17 -Werror \ -I<pytorch_root> -I<rocm_7.0_path>/include \ -c gloo/cuda_collectives_native.h # → error: ignoring return value ... [-Werror,-Wunused-result] ``` The fix adds `(void)` casts to explicitly discard the return value, which is the standard C++ pattern for suppressing `[[nodiscard]]` warnings. The casts are no-ops on CUDA and older ROCm versions. ## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers) ```bash # ctran (ibutils.cc, LogInit.cc) buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils # rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h) buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev # tcp_devmem (batch_unpack_producer.cc, shared_region.cc) buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common # gloo (cuda_collectives_native.h) buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip ``` All targets build successfully. The BUCK dep fixes for `comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and `comms/tcp_devmem/unpack` resolve link-time errors that surface when building under ROCm constraints. Reviewed By: bbeckca Differential Revision: D93759269
gyllstromk
added a commit
to gyllstromk/torchcomms
that referenced
this pull request
Mar 9, 2026
…e2 (meta-pytorch#960) Summary: X-link: pytorch/gloo#494 Pull Request resolved: meta-pytorch#960 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
gyllstromk
added a commit
to gyllstromk/gloo
that referenced
this pull request
Mar 9, 2026
…e2 (pytorch#494) Summary: Pull Request resolved: pytorch#494 X-link: meta-pytorch/torchcomms#960 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
d0f6d20 to
1704537
Compare
gyllstromk
added a commit
to gyllstromk/pytorch
that referenced
this pull request
Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671) Summary: X-link: pytorch/gloo#494 X-link: meta-pytorch/torchcomms#960 Pull Request resolved: pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Test Plan: ## Reproducing the [[nodiscard]] build errors (ROCm 7.0+) ROCm 7.0 HIP headers annotate CUDA-mapped API functions with `[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`). With `-Werror` enabled, any call site that discards the return value fails to compile: ``` error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result] cudaSetDevice(device); ^~~~~~~~~~~~~~~~~~~~~~ ``` To reproduce, build any affected target with ROCm >= 7.0. Example using the gloo HIP collectives (which calls `cudaSetDevice` and `cudaDeviceEnablePeerAccess` without checking the return value): ```bash hipcc -std=c++17 -Werror \ -I<pytorch_root> -I<rocm_7.0_path>/include \ -c gloo/cuda_collectives_native.h # → error: ignoring return value ... [-Werror,-Wunused-result] ``` The fix adds `(void)` casts to explicitly discard the return value, which is the standard C++ pattern for suppressing `[[nodiscard]]` warnings. The casts are no-ops on CUDA and older ROCm versions. ## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers) ```bash # ctran (ibutils.cc, LogInit.cc) buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils # rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h) buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev # tcp_devmem (batch_unpack_producer.cc, shared_region.cc) buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common # gloo (cuda_collectives_native.h) buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip ``` All targets build successfully. The BUCK dep fixes for `comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and `comms/tcp_devmem/unpack` resolve link-time errors that surface when building under ROCm constraints. Reviewed By: bbeckca Differential Revision: D93759269
gyllstromk
added a commit
to gyllstromk/gloo
that referenced
this pull request
Mar 9, 2026
…e2 (pytorch#494) Summary: X-link: meta-pytorch/torchcomms#960 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
1704537 to
17f8ded
Compare
gyllstromk
added a commit
to gyllstromk/gloo
that referenced
this pull request
Mar 9, 2026
…e2 (pytorch#494) Summary: X-link: meta-pytorch/torchcomms#960 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
17f8ded to
80cf7d5
Compare
gyllstromk
added a commit
to gyllstromk/gloo
that referenced
this pull request
Mar 9, 2026
…e2 (pytorch#494) Summary: X-link: meta-pytorch/torchcomms#960 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
80cf7d5 to
eabb08c
Compare
gyllstromk
added a commit
to gyllstromk/torchcomms
that referenced
this pull request
Mar 9, 2026
…e2 (meta-pytorch#960) Summary: X-link: pytorch/gloo#494 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
gyllstromk
added a commit
to gyllstromk/pytorch
that referenced
this pull request
Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671) Summary: X-link: pytorch/gloo#494 X-link: meta-pytorch/torchcomms#960 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Test Plan: ## Reproducing the [[nodiscard]] build errors (ROCm 7.0+) ROCm 7.0 HIP headers annotate CUDA-mapped API functions with `[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`). With `-Werror` enabled, any call site that discards the return value fails to compile: ``` error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result] cudaSetDevice(device); ^~~~~~~~~~~~~~~~~~~~~~ ``` To reproduce, build any affected target with ROCm >= 7.0. Example using the gloo HIP collectives (which calls `cudaSetDevice` and `cudaDeviceEnablePeerAccess` without checking the return value): ```bash hipcc -std=c++17 -Werror \ -I<pytorch_root> -I<rocm_7.0_path>/include \ -c gloo/cuda_collectives_native.h # → error: ignoring return value ... [-Werror,-Wunused-result] ``` The fix adds `(void)` casts to explicitly discard the return value, which is the standard C++ pattern for suppressing `[[nodiscard]]` warnings. The casts are no-ops on CUDA and older ROCm versions. ## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers) ```bash # ctran (ibutils.cc, LogInit.cc) buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils # rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h) buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev # tcp_devmem (batch_unpack_producer.cc, shared_region.cc) buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common # gloo (cuda_collectives_native.h) buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip ``` All targets build successfully. The BUCK dep fixes for `comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and `comms/tcp_devmem/unpack` resolve link-time errors that surface when building under ROCm constraints. Reviewed By: bbeckca Differential Revision: D93759269
gyllstromk
added a commit
to gyllstromk/pytorch
that referenced
this pull request
Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671) Summary: X-link: pytorch/gloo#494 X-link: meta-pytorch/torchcomms#960 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Test Plan: ## Reproducing the [[nodiscard]] build errors (ROCm 7.0+) ROCm 7.0 HIP headers annotate CUDA-mapped API functions with `[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`). With `-Werror` enabled, any call site that discards the return value fails to compile: ``` error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result] cudaSetDevice(device); ^~~~~~~~~~~~~~~~~~~~~~ ``` To reproduce, build any affected target with ROCm >= 7.0. Example using the gloo HIP collectives (which calls `cudaSetDevice` and `cudaDeviceEnablePeerAccess` without checking the return value): ```bash hipcc -std=c++17 -Werror \ -I<pytorch_root> -I<rocm_7.0_path>/include \ -c gloo/cuda_collectives_native.h # → error: ignoring return value ... [-Werror,-Wunused-result] ``` The fix adds `(void)` casts to explicitly discard the return value, which is the standard C++ pattern for suppressing `[[nodiscard]]` warnings. The casts are no-ops on CUDA and older ROCm versions. ## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers) ```bash # ctran (ibutils.cc, LogInit.cc) buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils # rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h) buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev # tcp_devmem (batch_unpack_producer.cc, shared_region.cc) buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common # gloo (cuda_collectives_native.h) buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip ``` All targets build successfully. The BUCK dep fixes for `comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and `comms/tcp_devmem/unpack` resolve link-time errors that surface when building under ROCm constraints. Reviewed By: bbeckca Differential Revision: D93759269
…e2 (pytorch#494) Summary: Pull Request resolved: pytorch#494 X-link: meta-pytorch/torchcomms#960 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
gyllstromk
added a commit
to gyllstromk/torchcomms
that referenced
this pull request
Mar 9, 2026
…e2 (meta-pytorch#960) Summary: X-link: pytorch/gloo#494 Pull Request resolved: meta-pytorch#960 X-link: pytorch/pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Reviewed By: bbeckca Differential Revision: D93759269
eabb08c to
eac7298
Compare
gyllstromk
added a commit
to gyllstromk/pytorch
that referenced
this pull request
Mar 9, 2026
…omms, gloo, caffe2 (pytorch#176671) Summary: X-link: pytorch/gloo#494 X-link: meta-pytorch/torchcomms#960 Pull Request resolved: pytorch#176671 ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy, hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree, hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with [[nodiscard]]. Combined with -Werror, this causes build failures wherever return values are discarded. Originally discovered building with ROCm 7.2 headers, but confirmed to also affect ROCm 7.0 builds (reported independently by yvliu and hqguo). The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP headers — the fix is the same for both versions. Changes: - Add (void) casts to suppress [[nodiscard]] warnings across comms/ (tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files - Fix BUCK dependency issues in comms/tcp_devmem/nccl (replace devmgr-client with common:common) and comms/tcp_devmem/unpack (explicit glog dep path) that surface when building these targets under ROCm constraints The (void) casts are no-ops on CUDA and older ROCm — safe to land regardless of ROCm version. Test Plan: ## Reproducing the [[nodiscard]] build errors (ROCm 7.0+) ROCm 7.0 HIP headers annotate CUDA-mapped API functions with `[[nodiscard]]` (e.g. `hipStreamDestroy`, `hipSetDevice`, `hipFree`). With `-Werror` enabled, any call site that discards the return value fails to compile: ``` error: ignoring return value of function declared with 'nodiscard' attribute [-Werror,-Wunused-result] cudaSetDevice(device); ^~~~~~~~~~~~~~~~~~~~~~ ``` To reproduce, build any affected target with ROCm >= 7.0. Example using the gloo HIP collectives (which calls `cudaSetDevice` and `cudaDeviceEnablePeerAccess` without checking the return value): ```bash hipcc -std=c++17 -Werror \ -I<pytorch_root> -I<rocm_7.0_path>/include \ -c gloo/cuda_collectives_native.h # → error: ignoring return value ... [-Werror,-Wunused-result] ``` The fix adds `(void)` casts to explicitly discard the return value, which is the standard C++ pattern for suppressing `[[nodiscard]]` warnings. The casts are no-ops on CUDA and older ROCm versions. ## Verification (fbcode, mode/amd-gpu with ROCm 7.0 headers) ```bash # ctran (ibutils.cc, LogInit.cc) buck2 build mode/amd-gpu fbcode//comms/ctran/backends/ib:ib buck2 build mode/amd-gpu fbcode//comms/ctran/utils:utils # rcclx (cudawrap.cc, register.cc, RcclxScubaLogger.h) buck2 build mode/amd-gpu fbcode//comms/rcclx:rcclx-dev # tcp_devmem (batch_unpack_producer.cc, shared_region.cc) buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/unpack:batch_unpack_producer buck2 build mode/amd-gpu fbcode//comms/tcp_devmem/common:common # gloo (cuda_collectives_native.h) buck2 build mode/amd-gpu fbcode//gloo:gloo_gpu_hip ``` All targets build successfully. The BUCK dep fixes for `comms/tcp_devmem/nccl` (nccl-gen, nccl-sim) and `comms/tcp_devmem/unpack` resolve link-time errors that surface when building under ROCm constraints. Reviewed By: bbeckca Differential Revision: D93759269
kapilsh
added a commit
to kapilsh/gloo
that referenced
this pull request
Mar 10, 2026
Summary: See CI signals on PR: pytorch#494 Runner: ubuntu-latest is a standard GitHub-hosted runner, which is a vanilla VM with no InfiniBand/RDMA hardware Build flags: -DUSE_IBVERBS=ON: compiles IBVERBS support and installs libibverbs-dev headers for compilation only `gloo_test` crashes with `terminate called recursively` (exit code 134) on CI runners that compile with IBVERBS/TLS support but lack the corresponding hardware at runtime (e.g., GitHub Actions ubuntu-latest with `-DUSE_IBVERBS=ON`). **Root cause:** `BaseTest::spawn()` calls `GTEST_SKIP()` from worker threads when `createDevice()` returns nullptr for an unavailable transport. GTest assertion/skip macros are not thread-safe — concurrent calls from multiple threads race on GTest's internal`TestPartResultReporterInterface`, corrupting state. This leads to an exception during stack unwinding, triggering recursive `std::terminate()`. The crash manifests at the first IBVERBS test case (`AllgatherRing/AllgatherTest.VarNumPointer/360`) because all prior transport tests (TCP, TCP_LAZY, TCP_TLS) succeed, and IBVERBS is the first transport where `createDevice()` returns nullptr on a machine without RDMA hardware. **Fix:** Probe transport availability from the main test thread before spawning workers. If the transport is unavailable, `GTEST_SKIP()` is called from the main thread (where it is safe) and the test returns early. Per-thread device creation is preserved for socket address isolation, with a silent early return as a defensive fallback. Differential Revision: D95934130
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
ROCm 7.0+ HIP headers annotate API functions (hipStreamDestroy,
hipMemcpyAsync, hipStreamSynchronize, hipSetDevice, hipGetDevice, hipFree,
hipHostUnregister, hipDeviceEnablePeerAccess, cuGetErrorString) with
[[nodiscard]]. Combined with -Werror, this causes build failures wherever
return values are discarded.
Originally discovered building with ROCm 7.2 headers, but confirmed to
also affect ROCm 7.0 builds (reported independently by yvliu and hqguo).
The [[nodiscard]] attribute is present in both ROCm 7.0 and 7.2 HIP
headers — the fix is the same for both versions.
Changes:
(tcp_devmem, ctran, rcclx), gloo/, and caffe2/ (nativert) — 12 C++ files
with common:common) and comms/tcp_devmem/unpack (explicit glog dep path)
that surface when building these targets under ROCm constraints
The (void) casts are no-ops on CUDA and older ROCm — safe to land
regardless of ROCm version.
Reviewed By: bbeckca
Differential Revision: D93759269