[cub] Fix compilation warnings with nvc++#9068
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR applies NVHPC compiler workarounds for atomicAdd_block and __umul64hi by using fallback implementations, migrates ten atomic dispatcher functions to the NV_IF_ELSE_TARGET pattern for consistent device/host selection, and adjusts test storage declarations and inline assembly constraints. ChangesNVHPC Compiler Workarounds
Atomic Dispatcher Migration to NV_IF_ELSE_TARGET
Test and Implementation Adjustments
Suggested reviewers
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
libcudacxx/include/cuda/std/__complex/literals.h (1)
49-53: ⚡ Quick winsuggestion: Use
_CCCL_HOST_APIon the NVHPC branch instead of leaving these overloads undecorated.
This workaround clearly wants host-only literals under NVHPC, but dropping the API macro entirely makes the availability of these public overloads depend on compiler defaults. Switching the NVHPC branch to_CCCL_HOST_APIkeeps the intent explicit and preserves the normal libcudacxx annotation contract. As per coding guidelines, "ensure functions are properly annotated with_CCCL_HOST_API,_CCCL_DEVICE_API, or_CCCL_API".Also applies to: 58-62, 67-71, 81-85
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6f58e9e1-4468-4f3c-9fd3-231eb8cb0750
📒 Files selected for processing (13)
cub/cub/agent/agent_histogram.cuhcub/cub/block/block_load_to_shared.cuhcub/cub/detail/fast_modulo_division.cuhcub/test/catch2_test_device_exclusive_scan_noncommutative.cucub/test/catch2_test_device_partition_flagged.cucub/test/catch2_test_device_segmented_scan_multi_segment.cucub/test/catch2_test_device_segmented_scan_noncommutative.cucub/test/catch2_test_device_transform.cucub/test/internal/catch2_test_integer_utils.culibcudacxx/include/cuda/__cmath/mul_hi.hlibcudacxx/include/cuda/std/__atomic/types/base.hlibcudacxx/include/cuda/std/__atomic/types/common.hlibcudacxx/include/cuda/std/__complex/literals.h
| NV_IS_HOST, | ||
| (return static_cast<unsigned_t>((static_cast<larger_t>(value) * multiplier) >> NumBits);), | ||
| ({return (sizeof(T) == 8) | ||
| // nvc++ doesn't implement __umul64hi and crashes, so we replace it with inline PTX | ||
| #if _CCCL_CUDA_COMPILER(NVHPC) | ||
| ? [value, multiplier](){ | ||
| unsigned long long result; | ||
| asm("mul.hi.u64 %0, %1, %2;" : "=l"(result) : "l"(value), "l"(multiplier)); | ||
| return static_cast<unsigned_t>(result); | ||
| }() | ||
| #else // ^^^ _CCCL_CUDA_COMPILER(NVHPC) ^^^ / vvv !_CCCL_CUDA_COMPILER(NVHPC) vvv | ||
| ? static_cast<unsigned_t>(__umul64hi(value, multiplier)) | ||
| #endif // ^^^ !_CCCL_CUDA_COMPILER(NVHPC) ^^^ | ||
| : static_cast<unsigned_t>((static_cast<larger_t>(value) * multiplier) >> NumBits);})); |
There was a problem hiding this comment.
Critical: We use conditional compilation inside an NV_IF_TARGET. you need to wrap outside of it
| if (threadIdx.x == 0) | ||
| { | ||
| ssmem = 0; | ||
| } |
There was a problem hiding this comment.
Suggestion: can you please add a comment why this is necessary? And that it's because of nvc++.
| { | ||
| [[maybe_unused]] const auto __lhs1 = static_cast<unsigned long long>(__lhs); | ||
| [[maybe_unused]] const auto __rhs1 = static_cast<unsigned long long>(__rhs); | ||
| // nvc++ doesn't implement __umul64hi and crashes, so we replace it with inline PTX |
There was a problem hiding this comment.
Suggestion: Consider filing a bug report
| // nvc++ has a bug that it does not treat atomicAdd_block as a builtin and ptxas fails due to unresolved | ||
| // symbol. | ||
| #if _CCCL_CUDA_COMPILER(NVHPC) | ||
| atomicAdd(privatized_histograms[ch] + bins[pixel], accumulator); |
There was a problem hiding this comment.
Q: Why hasn't this come up yet? Or: why hasn't this been reported by the nvc++ team yet? What change makes this a problem now? I want to understand where this is coming from.
There was a problem hiding this comment.
Probably because they don't use histogram. And we don't use it anywhere else
😬 CI Workflow Results🟥 Finished in 4h 38m: Pass: 78%/341 | Total: 10d 10h | Max: 1h 40m | Hits: 32%/1718464See results here. |
jrhemstad
left a comment
There was a problem hiding this comment.
I'd like to discuss the motivation for this before adding additional complexity to CUB's source code.
This PR fixes compilation warnings emitted during CUB compilation with
nvc++.