optimize blockReduceWarp by charan-003 · Pull Request #9101 · NVIDIA/cccl

charan-003 · 2026-05-21T20:05:32Z

Closes #5167

Replace sequential aggregation with cooperative warp reduction

copy-pr-bot · 2026-05-21T20:05:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-21T20:08:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

suggestion:

Walkthrough

Refactors BlockReduce warp-aggregate application so warp 0 deterministically cooperates to reduce per-warp aggregates (reduce_op_sync fast path, shuffle fallback), removes the atomic non-deterministic path, consolidates Sum/Reduce, and adds NVBench benchmark plus CI bench filter/GPU updates.

Changes

Warp-0 Cooperative Reduction

Layer / File(s)	Summary
Dependencies and documentation updates `cub/cub/block/specializations/block_reduce_warp_reductions.cuh`	Header includes reordered to add thread-operator and target utilities; comment updated to state warp 0 cooperatively reduces per-warp aggregates.
ApplyWarpAggregates warp-0 cooperative reduction `cub/cub/block/specializations/block_reduce_warp_reductions.cuh`	ApplyWarpAggregates reimplemented: compute effective num_warps, warp 0 loads each warp's aggregate per lane, then reduces via target-gated `reduce_op_sync` (fast path for eligible operator/type on supported SMs when FullTile) or `WarpReduceShfl` fallback.
Sum and Reduce entry point consolidation `cub/cub/block/specializations/block_reduce_warp_reductions.cuh`	Sum and Reduce now unconditionally invoke `ApplyWarpAggregates<FullTile>`, removing the IsDeterministic compile-time branch and the atomic-backed non-deterministic implementation.

Benchmarks and CI

Layer / File(s)	Summary
NVBench base benchmark `cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh`	Adds device functor `benchmark_op_t<BlockThreads>`, benchmark launcher `block_reduce_warp_reductions`, occupancy-based grid sizing with zero-grid guard, and NVBench registration.
Sum benchmark TU `cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu`	Adds `value_types` list and `op_t = ::cuda::std::plus<>`, includes the shared base implementation.
CI benchmark filters & GPU list `ci/bench.yaml`	Adds two reduce benchmark filter regexes for new benchmarks and enables `rtxa6000` in the CI GPU selection list.

Assessment against linked issues

Objective	Addressed	Explanation
Replace sequential per-warp aggregation with low-latency warp reduction (`#5167`)	✅
Use `__reduce_sync_op` or equivalent on supported SM targets (`#5167`)	✅
Reduce compile-time overhead from recursive `ApplyWarpAggregates` calls (`#5167`)	❓	The IsDeterministic branch and one non-deterministic function were removed, but the change retains templated FullTile/type/operator paths; net compile-time instantiation reduction is unclear from diffs provided.

Out-of-scope changes

Code Change	Explanation
Add NVBench benchmark base (`cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh`, lines 1-54)	Benchmark/test artifact; not required by the BlockReduce optimization objective.
Add benchmark TU for sum (`cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu`, lines 1-35)	Benchmark wiring and type-list instantiation are testing artifacts, not part of core optimization.
Update CI bench filters and GPU list (`ci/bench.yaml`, lines 35-36 and 49-53)	CI configuration updates are operational/test routing changes unrelated to BlockReduce implementation goals.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fbusato · 2026-05-21T21:25:44Z

@charan-003, thanks a lot for the contribution.

Even before starting the review, I would like to see if there are actual performance benefits for this approach.
I would also encourage you to write a device-side benchmark for it, similarly to WarpReduce

charan-003 · 2026-05-21T21:28:07Z

@charan-003, thanks a lot for the contribution.

Even before starting the review, I would like to see if there are actual performance benefits for this approach. I would also encourage you to write a device-side benchmark for it, similarly to WarpReduce

sure, working on it

coderabbitai

🧹 Nitpick comments (2)

cub/cub/block/specializations/block_reduce_warp_reductions.cuh (2)
125-125: 💤 Low value

suggestion: Use static_cast<int>(warps) instead of C-style cast int(warps) for consistency with the static_cast used in the same expression.
-      const int num_warps = FullTile ? int(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));
+      const int num_warps = FullTile ? static_cast<int>(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));
148-148: 💤 Low value

suggestion: Per coding guidelines, variables not modified should be const. dummy_storage is only passed to the constructor.
-      NullType dummy_storage;
+      const NullType dummy_storage{};

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ffc63f63-2430-4ac6-8a11-ae99c579665d

📥 Commits

Reviewing files that changed from the base of the PR and between 0b48748 and cb6dd6c.

📒 Files selected for processing (1)

cub/cub/block/specializations/block_reduce_warp_reductions.cuh

fbusato · 2026-05-21T23:21:08Z

@charan-003 thanks for reporting them but I think a better approach would be:

Add the code for the device-side benchmark.
Modify cccl/ci/bench.yaml to select the GPUs where to run the benchmarks.
Add a commit with the message [bench-only] to run them.
Report only the comparison results, e.g. SLOW, FAST, etc.

charan-003 · 2026-05-22T00:56:02Z

@charan-003 thanks for reporting them but I think a better approach would be:

Add the code for the device-side benchmark.

Modify cccl/ci/bench.yaml to select the GPUs where to run the benchmarks.

Add a commit with the message [bench-only] to run them.

Report only the comparison results, e.g. SLOW, FAST, etc.

Thanks a lot for the guidance! I didn't know the exact process...
Let me add device-side benchmark code

coderabbitai

Actionable comments posted: 3

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 53e3ccfa-d3a5-484a-85fd-7cf18e522152

📥 Commits

Reviewing files that changed from the base of the PR and between cb6dd6c and 830c2b4.

📒 Files selected for processing (3)

ci/bench.yaml
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu

✅ Files skipped from review due to trivial changes (1)

cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu

coderabbitai · 2026-05-22T01:16:22Z

+template <int BlockThreads>
+struct benchmark_op_t
+{
+  template <typename T>


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

suggestion: Line 17 and Line 35 use a single-letter template parameter (T). Use _Tp-style naming here to satisfy the template-parameter naming rule for this codebase.

As per coding guidelines "Avoid single letter names for template parameters; use _Tp instead of _T".

Also applies to: 35-35

coderabbitai · 2026-05-22T01:16:22Z

+  __device__ __forceinline__ T operator()(T thread_data) const
+  {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Read-only verification: confirm missing CCCL API annotations / noexcept in this file rg -nP '^\s*(template\s*<[^>]+>\s*)?(?!.*_CCCL_(HOST_API|DEVICE_API|API)).*\b(operator|block_reduce_warp_reductions)\b' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh rg -nP '\b(operator|block_reduce_warp_reductions)\b(?!.*noexcept)' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 391

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1,120p' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh | nl -ba | sed -n '10,80p'

Repository: NVIDIA/cccl

Length of output: 98

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "operator|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 289

🏁 Script executed:

#!/bin/bash set -euo pipefail # Capture the full function signatures/headers for the two targets (best-effort) perl -0777 -ne ' my $f=$_; if($f =~ /(operator\s*const[^\n]*\n\s*\{)/s){ print "FOUND operator():\n$1\n"; } if($f =~ /(block_reduce_warp_reductions[^\n]*\n\s*\{)/s){ print "FOUND block_reduce_warp_reductions():\n$1\n"; } ' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 170

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the relevant top portion with line numbers (no nl dependency) awk '{printf "%5d:%s\n", NR, $0}' cub/benchmarks/bench/reduce/block_reduce_warp_warp_reductions_base.cuh | sed -n '1,110p'

Repository: NVIDIA/cccl

Length of output: 196

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check for noexcept / CCCL API tokens near the declarations rg -n "operator|block_reduce_warp_reductions" -n cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh rg -n --context=3 "operator|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh rg -n --context=0 "_CCCL_(HOST_API|DEVICE_API|API)" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 1049

🏁 Script executed:

#!/bin/bash set -euo pipefail # Look for exception-throwing in the relevant functions/bodies rg -n "\bthrow\b" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 37

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh" # Print a small, line-numbered window around operator() and block_reduce_warp_reductions sed -n '10,60p' "$FILE" | awk '{printf "%5d:%s\n", NR+10-1, $0}'

Repository: NVIDIA/cccl

Length of output: 2048

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh" # Check for _CCCL_* usage and noexcept in/near the relevant symbols rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE" rg -n "noexcept" "$FILE" rg -n --context=2 "operator" "$FILE" rg -n --context=2 "block_reduce_warp_reductions" "$FILE"

Repository: NVIDIA/cccl

Length of output: 37

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh" # Look for exception throwing inside this file rg -n "\bthrow\b" "$FILE"

Repository: NVIDIA/cccl

Length of output: 37

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh" echo "== _CCCL_* tokens ==" rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE" || true echo "== noexcept tokens ==" rg -n "noexcept" "$FILE" || true echo "== throw tokens ==" rg -n "\bthrow\b" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 117

important: Add required _CCCL_* API annotations and noexcept to benchmark functions.

benchmark_op_t::operator() (line 18) lacks _CCCL_DEVICE_API and noexcept.

block_reduce_warp_reductions (line 36) lacks _CCCL_HOST_API and noexcept.

Suggested change

- __device__ __forceinline__ T operator()(T thread_data) const + _CCCL_DEVICE_API __forceinline__ auto operator()(_Tp thread_data) const noexcept -> _Tp

-template <typename T> -void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<T>) +template <typename _Tp> +_CCCL_HOST_API void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<_Tp>) noexcept

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 519c38f3-5544-4f5a-9750-fc04a66589e0

📥 Commits

Reviewing files that changed from the base of the PR and between 830c2b4 and a949e20.

📒 Files selected for processing (3)

ci/bench.yaml
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu

✅ Files skipped from review due to trivial changes (1)

ci/bench.yaml

fbusato · 2026-05-22T01:32:21Z

/ok to test 80b5ac2

github-actions · 2026-05-22T01:39:39Z

❌ Benchmark Results

Benchmark comparison had failures.

Results
Artifacts

charan-003 · 2026-05-22T02:17:13Z

@fbusato the benchmarks shows a slowdown. can we use bit_ceil(warps) for the logical warp size. Does this approach look reasonable, or would you suggest a different direction?

fbusato · 2026-05-22T17:12:00Z

The code looks good and I really like the simplifications. Related to performance, some high-level thoughts:

It is not worthwhile to use Redux/Warp-Shuffle Reduction without a full mask because it falls back to a slower implementation. You can use the identity element in some cases: https://nvidia.github.io/cccl/unstable/libcudacxx/extended_api/functional/operator_properties.html#identity-element
Warp-shuffle reduction could be useful only with large thread blocks. The mental model is that warp-shuffle is faster than shared memory but only marginally. bit_ceil(warps) for the logical warp size is a good idea
I would dispatch between the new and old implementation depending on the above conditions.

charan-003 requested a review from a team as a code owner May 21, 2026 20:05

charan-003 requested a review from NaderAlAwar May 21, 2026 20:05

github-project-automation Bot added this to CCCL May 21, 2026

github-project-automation Bot moved this to Todo in CCCL May 21, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 21, 2026

optimize blockReduceWarp

cb6dd6c

charan-003 force-pushed the optimize_blockreduce branch from 0b48748 to cb6dd6c Compare May 21, 2026 22:10

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

charan-003 requested review from a team as code owners May 22, 2026 01:11

charan-003 requested a review from wmaxey May 22, 2026 01:11

[bench-only]

a949e20

charan-003 force-pushed the optimize_blockreduce branch from 830c2b4 to a949e20 Compare May 22, 2026 01:13

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

[bench-only]

80b5ac2

[bench-only]

dae2e4a

		__device__ __forceinline__ T operator()(T thread_data) const
		{

Conversation

charan-003 commented May 21, 2026

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Assessment against linked issues

Out-of-scope changes

Uh oh!

fbusato commented May 21, 2026

Uh oh!

charan-003 commented May 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

fbusato commented May 21, 2026

Uh oh!

charan-003 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fbusato commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

❌ Benchmark Results

Uh oh!

charan-003 commented May 22, 2026

Uh oh!

fbusato commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 21, 2026 •

edited

Loading

charan-003 commented May 22, 2026 •

edited

Loading