Refactor DeviceReduce dispatch logic#9088
Conversation
📝 WalkthroughSummary by CodeRabbit
suggestion: WalkthroughThis PR refactors determinism dispatch for DeviceReduce: reduce_impl now uses compile-time determinism branches that call ChangesDevice reduce determinism dispatch refactoring
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 306f06d3-aefb-42db-b33e-5527015ec795
📒 Files selected for processing (1)
cub/cub/device/device_reduce.cuh
This comment has been minimized.
This comment has been minimized.
b295493 to
96c47e0
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cub/cub/device/device_reduce.cuh (1)
114-161: important: This changes dispatch and determinism selection on aDevice*algorithm path. Please do the required before/after SASS comparison and run the CUB benchmarks before merge; otherwise codegen and throughput regressions in the new dispatch split stay unvalidated.As per coding guidelines,
**/cub/**/device*.{cu,cuh,h}: Verify no SASS code generation changes occur for Device* algorithms in CUB by comparing generated SASS output before and after changes; Run benchmark tests using the CUB Benchmarks framework when modifying Device* algorithms to verify no performance regressions occur.Also applies to: 173-240, 244-270, 549-551
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6fde9ca1-11b8-4a40-aa81-d6719380fcf5
📒 Files selected for processing (1)
cub/cub/device/device_reduce.cuh
🥳 CI Workflow Results🟩 Finished in 2h 33m: Pass: 100%/285 | Total: 11d 04h | Max: 2h 32m | Hits: 21%/914257See results here. |
tpn
left a comment
There was a problem hiding this comment.
Reviewed the DeviceReduce dispatch refactor. No blocking issues from me; the CI summary is green.
| return reduce_impl( | ||
| d_in, d_out, num_items, ::cuda::std::plus<>{}, ::cuda::std::identity{}, InitT{}, determinism_t{}, env); | ||
| using accum_t = ::cuda::std::__accumulator_t<::cuda::std::plus<>, cub::detail::it_value_t<InputIteratorT>, OutputT>; | ||
| return __transform_reduce<accum_t>( |
There was a problem hiding this comment.
Critical: this breaks Sum(uint8_t*, uint8_t*, ...) so that it doesn't compile anymore. Prior to this change, we fell back to run to run determinism for output types < 4 bytes. But __transform_reduce checks AccumT instead of OutputT. For uint8_t, AccumT is int due to integer promotion, so the kernel will attempt to do atomicAdd(unsigned char*, int) which doesn't compile
There was a problem hiding this comment.
This issue already existed for Reduce() before this PR, but sum had its own detection logic
Thinking about this some more, it is probably okay to merge this PR and I'll create a fix for __transform_reduce to check OutputT instead of AccumT, or you can driveby fix it if you want, but the fix is slightly involved and I would like to add tests to verify the behavior. We basically need to check two things:
OutputT == AccumTatomicAdd()is supported for that type. This is mostly handled byis_4b_or_greaterbut it occurs to me now that there is no overload forlong long.
I'm going to approve, let me know if you prefer fixing it here or I can fix it after this gets merged
No description provided.