Skip to content

[cub] Fix compilation warnings with nvc++#9068

Open
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:fix_nvhpc_cub_warnings
Open

[cub] Fix compilation warnings with nvc++#9068
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:fix_nvhpc_cub_warnings

Conversation

@davebayer
Copy link
Copy Markdown
Contributor

This PR fixes compilation warnings emitted during CUB compilation with nvc++.

@davebayer davebayer requested review from a team as code owners May 19, 2026 08:02
@davebayer davebayer requested a review from ericniebler May 19, 2026 08:02
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 19, 2026
@davebayer davebayer requested a review from srinivasyadav18 May 19, 2026 08:02
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 19, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced compatibility with the NVHPC CUDA compiler for atomic operations and mathematical functions.
    • Improved device/host code dispatch for atomic operations and memory operations.
  • Refactor

    • Optimized compiler-specific code paths for better cross-compiler support and reliability.

Walkthrough

This PR applies NVHPC compiler workarounds for atomicAdd_block and __umul64hi by using fallback implementations, migrates ten atomic dispatcher functions to the NV_IF_ELSE_TARGET pattern for consistent device/host selection, and adjusts test storage declarations and inline assembly constraints.

Changes

NVHPC Compiler Workarounds

Layer / File(s) Summary
Histogram accumulation atomic workaround
cub/cub/agent/agent_histogram.cuh
Three accumulation paths (RLE per-run, RLE final-pixel, non-RLE per-pixel) now use plain atomicAdd under NVHPC, bypassing atomicAdd_block to work around a compiler issue.
64-bit multiply-high inline PTX
cub/cub/detail/fast_modulo_division.cuh, libcudacxx/include/cuda/__cmath/mul_hi.h
NVHPC builds use inline PTX (mul.hi.u64) instead of __umul64hi for 64-bit high-word multiplication; non-NVHPC continues using the intrinsic.
Complex literal API decoration for NVHPC
libcudacxx/include/cuda/std/__complex/literals.h
Long-double complex literal overloads conditionally omit _CCCL_API decoration when compiling with NVHPC to avoid hard errors in device code.

Atomic Dispatcher Migration to NV_IF_ELSE_TARGET

Layer / File(s) Summary
Fence and signal fence dispatchers
libcudacxx/include/cuda/std/__atomic/types/base.h
__atomic_thread_fence_dispatch and __atomic_signal_fence_dispatch migrate to NV_IF_ELSE_TARGET for device/host routing.
Store, load, and exchange dispatchers
libcudacxx/include/cuda/std/__atomic/types/base.h
__atomic_store/load/exchange_dispatch functions switch to NV_IF_ELSE_TARGET with updated return/flow control.
Compare-exchange dispatchers
libcudacxx/include/cuda/std/__atomic/types/base.h
Strong and weak compare-exchange dispatchers migrate to NV_IF_ELSE_TARGET with __result assigned via wrapped expressions.
Fetch-style operation dispatchers
libcudacxx/include/cuda/std/__atomic/types/base.h
Fetch-add, fetch-sub, fetch-and/or/xor, and min/max dispatchers convert to NV_IF_ELSE_TARGET with device branches returning through expressions.
Memory comparison dispatcher
libcudacxx/include/cuda/std/__atomic/types/common.h
__atomic_memcmp refactors to NV_IF_ELSE_TARGET, wrapping device-side bytewise comparison in a statement-expression.

Test and Implementation Adjustments

Layer / File(s) Summary
Storage duration adjustments
cub/test/catch2_test_device_exclusive_scan_noncommutative.cu, cub/test/catch2_test_device_segmented_scan_multi_segment.cu, cub/test/catch2_test_device_segmented_scan_noncommutative.cu
Lookup table declarations change from static constexpr to constexpr, removing static storage while preserving const-initialization.
Host/device qualification cleanup
cub/test/catch2_test_device_partition_flagged.cu
__host__ __device__ qualifiers removed from operator<< friend declaration in test helper type.
Inline assembly and control flow
cub/test/catch2_test_device_transform.cu, cub/cub/block/block_load_to_shared.cuh
Shared-memory references in kernel inline-assembly use input-only constraints; __try_wait() adds _CCCL_UNREACHABLE() after dispatcher.
Type casting in floating-point test
cub/test/internal/catch2_test_integer_utils.cu
Integer operation result explicitly cast to floating-point type T before comparison in assertion.

Suggested reviewers

  • pciolkosz
  • ericniebler
  • bernhardmgruber

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
libcudacxx/include/cuda/std/__complex/literals.h (1)

49-53: ⚡ Quick win

suggestion: Use _CCCL_HOST_API on the NVHPC branch instead of leaving these overloads undecorated.
This workaround clearly wants host-only literals under NVHPC, but dropping the API macro entirely makes the availability of these public overloads depend on compiler defaults. Switching the NVHPC branch to _CCCL_HOST_API keeps the intent explicit and preserves the normal libcudacxx annotation contract. As per coding guidelines, "ensure functions are properly annotated with _CCCL_HOST_API, _CCCL_DEVICE_API, or _CCCL_API".

Also applies to: 58-62, 67-71, 81-85


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f58e9e1-4468-4f3c-9fd3-231eb8cb0750

📥 Commits

Reviewing files that changed from the base of the PR and between 96e580b and eb2cfbc.

📒 Files selected for processing (13)
  • cub/cub/agent/agent_histogram.cuh
  • cub/cub/block/block_load_to_shared.cuh
  • cub/cub/detail/fast_modulo_division.cuh
  • cub/test/catch2_test_device_exclusive_scan_noncommutative.cu
  • cub/test/catch2_test_device_partition_flagged.cu
  • cub/test/catch2_test_device_segmented_scan_multi_segment.cu
  • cub/test/catch2_test_device_segmented_scan_noncommutative.cu
  • cub/test/catch2_test_device_transform.cu
  • cub/test/internal/catch2_test_integer_utils.cu
  • libcudacxx/include/cuda/__cmath/mul_hi.h
  • libcudacxx/include/cuda/std/__atomic/types/base.h
  • libcudacxx/include/cuda/std/__atomic/types/common.h
  • libcudacxx/include/cuda/std/__complex/literals.h

Comment on lines 106 to 119
NV_IS_HOST,
(return static_cast<unsigned_t>((static_cast<larger_t>(value) * multiplier) >> NumBits);),
({return (sizeof(T) == 8)
// nvc++ doesn't implement __umul64hi and crashes, so we replace it with inline PTX
#if _CCCL_CUDA_COMPILER(NVHPC)
? [value, multiplier](){
unsigned long long result;
asm("mul.hi.u64 %0, %1, %2;" : "=l"(result) : "l"(value), "l"(multiplier));
return static_cast<unsigned_t>(result);
}()
#else // ^^^ _CCCL_CUDA_COMPILER(NVHPC) ^^^ / vvv !_CCCL_CUDA_COMPILER(NVHPC) vvv
? static_cast<unsigned_t>(__umul64hi(value, multiplier))
#endif // ^^^ !_CCCL_CUDA_COMPILER(NVHPC) ^^^
: static_cast<unsigned_t>((static_cast<larger_t>(value) * multiplier) >> NumBits);}));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: We use conditional compilation inside an NV_IF_TARGET. you need to wrap outside of it

@github-project-automation github-project-automation Bot moved this from In Review to In Progress in CCCL May 19, 2026
Comment on lines +729 to +732
if (threadIdx.x == 0)
{
ssmem = 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: can you please add a comment why this is necessary? And that it's because of nvc++.

{
[[maybe_unused]] const auto __lhs1 = static_cast<unsigned long long>(__lhs);
[[maybe_unused]] const auto __rhs1 = static_cast<unsigned long long>(__rhs);
// nvc++ doesn't implement __umul64hi and crashes, so we replace it with inline PTX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Consider filing a bug report

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already told @dkolsen-pgi

Comment on lines +310 to +313
// nvc++ has a bug that it does not treat atomicAdd_block as a builtin and ptxas fails due to unresolved
// symbol.
#if _CCCL_CUDA_COMPILER(NVHPC)
atomicAdd(privatized_histograms[ch] + bins[pixel], accumulator);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why hasn't this come up yet? Or: why hasn't this been reported by the nvc++ team yet? What change makes this a problem now? I want to understand where this is coming from.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because they don't use histogram. And we don't use it anywhere else

@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 4h 38m: Pass: 78%/341 | Total: 10d 10h | Max: 1h 40m | Hits: 32%/1718464

See results here.

Copy link
Copy Markdown
Collaborator

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to discuss the motivation for this before adding additional complexity to CUB's source code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants