Skip to content

Refactor DeviceReduce dispatch logic#9088

Open
bernhardmgruber wants to merge 3 commits into
NVIDIA:mainfrom
bernhardmgruber:ref_reduce
Open

Refactor DeviceReduce dispatch logic#9088
bernhardmgruber wants to merge 3 commits into
NVIDIA:mainfrom
bernhardmgruber:ref_reduce

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

No description provided.

@bernhardmgruber bernhardmgruber requested a review from a team as a code owner May 21, 2026 09:22
@bernhardmgruber bernhardmgruber requested a review from fbusato May 21, 2026 09:22
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 21, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Reworked internal device reduction logic to centralize min/max handling, standardize determinism fallbacks for sum/min/max operations, and simplify dispatch paths.
    • No public API changes; results are now more consistent and reliability of reduction operations under different environment settings has been improved.

suggestion:

Walkthrough

This PR refactors determinism dispatch for DeviceReduce: reduce_impl now uses compile-time determinism branches that call detail::dispatch_with_env_and_tuning; determinism fallback computation was moved into __transform_reduce; min/max behavior was centralized into __minmax_reduce; Sum/Min/Max env overloads now delegate to those helpers.

Changes

Device reduce determinism dispatch refactoring

Layer / File(s) Summary
reduce_impl determinism dispatch
cub/cub/device/device_reduce.cuh
reduce_impl dispatch switched from lambda-based tuning/query to explicit if constexpr branches on determinism tags (gpu_to_gpu, not_guaranteed, default), each calling the corresponding dispatch routine via detail::dispatch_with_env_and_tuning.
Transform-reduce determinism handling
cub/cub/device/device_reduce.cuh
__transform_reduce now computes default_determinism_t from environment requirements, defines determinism-dependent fallback predicates for integral types and float/double plus/min/max operators, and refines the "4B or greater" condition to use sizeof(AccumT).
Min/max reduce helper introduction
cub/cub/device/device_reduce.cuh
New private __minmax_reduce helper centralizes min/max env determinism handling: rejects gpu_to_gpu, forces run_to_run, computes output limits, and forwards to reduce_impl with identity transform.
Public API rewiring
cub/cub/device/device_reduce.cuh
Sum env overload simplified to call __transform_reduce with plus/identity. Min and Max env overloads compute OutputT and limits, then delegate to __minmax_reduce, removing duplicated determinism bodies.

Suggested reviewers

  • shwina
  • pauleonix

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 306f06d3-aefb-42db-b33e-5527015ec795

📥 Commits

Reviewing files that changed from the base of the PR and between 8b47911 and b295493.

📒 Files selected for processing (1)
  • cub/cub/device/device_reduce.cuh

Comment thread cub/cub/device/device_reduce.cuh
Comment thread cub/cub/device/device_reduce.cuh
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cub/cub/device/device_reduce.cuh (1)

114-161: important: This changes dispatch and determinism selection on a Device* algorithm path. Please do the required before/after SASS comparison and run the CUB benchmarks before merge; otherwise codegen and throughput regressions in the new dispatch split stay unvalidated.

As per coding guidelines, **/cub/**/device*.{cu,cuh,h}: Verify no SASS code generation changes occur for Device* algorithms in CUB by comparing generated SASS output before and after changes; Run benchmark tests using the CUB Benchmarks framework when modifying Device* algorithms to verify no performance regressions occur.

Also applies to: 173-240, 244-270, 549-551


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6fde9ca1-11b8-4a40-aa81-d6719380fcf5

📥 Commits

Reviewing files that changed from the base of the PR and between b295493 and 96c47e0.

📒 Files selected for processing (1)
  • cub/cub/device/device_reduce.cuh

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 33m: Pass: 100%/285 | Total: 11d 04h | Max: 2h 32m | Hits: 21%/914257

See results here.

Copy link
Copy Markdown
Contributor

@tpn tpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the DeviceReduce dispatch refactor. No blocking issues from me; the CI summary is green.

return reduce_impl(
d_in, d_out, num_items, ::cuda::std::plus<>{}, ::cuda::std::identity{}, InitT{}, determinism_t{}, env);
using accum_t = ::cuda::std::__accumulator_t<::cuda::std::plus<>, cub::detail::it_value_t<InputIteratorT>, OutputT>;
return __transform_reduce<accum_t>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: this breaks Sum(uint8_t*, uint8_t*, ...) so that it doesn't compile anymore. Prior to this change, we fell back to run to run determinism for output types < 4 bytes. But __transform_reduce checks AccumT instead of OutputT. For uint8_t, AccumT is int due to integer promotion, so the kernel will attempt to do atomicAdd(unsigned char*, int) which doesn't compile

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue already existed for Reduce() before this PR, but sum had its own detection logic

Thinking about this some more, it is probably okay to merge this PR and I'll create a fix for __transform_reduce to check OutputT instead of AccumT, or you can driveby fix it if you want, but the fix is slightly involved and I would like to add tests to verify the behavior. We basically need to check two things:

  1. OutputT == AccumT
  2. atomicAdd() is supported for that type. This is mostly handled by is_4b_or_greater but it occurs to me now that there is no overload for long long.

I'm going to approve, let me know if you prefer fixing it here or I can fix it after this gets merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants