Skip to content

Recover warp_shuffle original behavior (revert #8210)#8254

Open
fbusato wants to merge 2 commits intoNVIDIA:mainfrom
fbusato:revert-warp-shuffle-constraints
Open

Recover warp_shuffle original behavior (revert #8210)#8254
fbusato wants to merge 2 commits intoNVIDIA:mainfrom
fbusato:revert-warp-shuffle-constraints

Conversation

@fbusato
Copy link
Copy Markdown
Contributor

@fbusato fbusato commented Apr 1, 2026

Description

#8210 enforces additional constraints to the type allowed in cuda::device::warp_shuffle_*, namely default contractible and trivially copyable.

While this is conceptually correct, the new constraints prevent using warp shuffle instructions with types that practically satisfy these property but where the type traits fail, e.g. __half, __nv_bfloat16, other reduced precision floating-point types, composition of them like array and structures. Enforcing such constraints prevent using them in many context, e.g. CUB.

This PR reverts the original behavior until we don't find a reliable way to prevent the problem.

@fbusato fbusato added this to CCCL Apr 1, 2026
@fbusato fbusato requested a review from a team as a code owner April 1, 2026 01:08
@fbusato fbusato added the libcu++ For all items related to libcu++ label Apr 1, 2026
@fbusato fbusato requested a review from davebayer April 1, 2026 01:08
@github-project-automation github-project-automation bot moved this to Todo in CCCL Apr 1, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 1, 2026
[[nodiscard]] _CCCL_DEVICE_API warp_shuffle_result<_Up> warp_shuffle_idx(
const _Tp& __data, int __src_lane, uint32_t __lane_mask = 0xFFFFFFFF, ::cuda::std::integral_constant<int, _Width> = {})
{
static_assert(::cuda::std::is_default_constructible_v<_Tp>, "_Tp must be default constructible");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Instead of wholesale removing these checks, should we just add explicit exceptions for known types? With the ability to allow people to proclaim types as valid for use with these APIs?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @fbusato explained to me, the problem is that if there is a struct containing __half, it won't be trivially copyable.. However, we do the same think for cuda::std::bit_cast and noone has complained yet.

But I think we should keep at least the requirement on default constructibility.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would second default_constructability, because that is a much clearer error message than what a C++ compiler generates 5 lines below

Copy link
Copy Markdown
Collaborator

@jrhemstad jrhemstad Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend taking a look at what we did in cuCollections by offering a is_bitwise_comparable custom trait. By default, we use has_unique_object_representation<T>, but that is false for floating-point values due to NaNs. However, for the majority of use cases that doesn't matter, and so we allow an escape hatch of specializing is_bitwise_comparable to opt-in. We emit a helpful diagnostic when this situation arises pointing people towards specializing is_bitwise_comparable.

We could do something similar here.

https://github.com/NVIDIA/cuCollections/blob/6477be2182668015f9a91e3a0bb7e248eceecd09/include/cuco/utility/traits.hpp#L24-L60

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is a bit invasive but nice. The problem affects other warp instructions as well, so this solution applies to all of them. We can specialize the new type traits for reduced precision floating points + array.

I opened an RFE for the compiler nvbug 5497120 a while ago. We can rely on the proposed solution until we don't get an official workaround.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the funny aspect is that has_unique_object_representation<T> recognizes __half, __nv_bfloat16 as unique object representation, while this is not the case

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 49m: Pass: 100%/99 | Total: 2d 03h | Max: 1h 30m | Hits: 94%/257973

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libcu++ For all items related to libcu++

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants