Skip to content

optimize blockReduceWarp#9101

Open
charan-003 wants to merge 4 commits into
NVIDIA:mainfrom
charan-003:optimize_blockreduce
Open

optimize blockReduceWarp#9101
charan-003 wants to merge 4 commits into
NVIDIA:mainfrom
charan-003:optimize_blockreduce

Conversation

@charan-003
Copy link
Copy Markdown
Contributor

Closes #5167

Replace sequential aggregation with cooperative warp reduction

@charan-003 charan-003 requested a review from a team as a code owner May 21, 2026 20:05
@charan-003 charan-003 requested a review from NaderAlAwar May 21, 2026 20:05
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 21, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

suggestion:

Walkthrough

Refactors BlockReduce warp-aggregate application so warp 0 deterministically cooperates to reduce per-warp aggregates (reduce_op_sync fast path, shuffle fallback), removes the atomic non-deterministic path, consolidates Sum/Reduce, and adds NVBench benchmark plus CI bench filter/GPU updates.

Changes

Warp-0 Cooperative Reduction

Layer / File(s) Summary
Dependencies and documentation updates
cub/cub/block/specializations/block_reduce_warp_reductions.cuh
Header includes reordered to add thread-operator and target utilities; comment updated to state warp 0 cooperatively reduces per-warp aggregates.
ApplyWarpAggregates warp-0 cooperative reduction
cub/cub/block/specializations/block_reduce_warp_reductions.cuh
ApplyWarpAggregates reimplemented: compute effective num_warps, warp 0 loads each warp's aggregate per lane, then reduces via target-gated reduce_op_sync (fast path for eligible operator/type on supported SMs when FullTile) or WarpReduceShfl fallback.
Sum and Reduce entry point consolidation
cub/cub/block/specializations/block_reduce_warp_reductions.cuh
Sum and Reduce now unconditionally invoke ApplyWarpAggregates<FullTile>, removing the IsDeterministic compile-time branch and the atomic-backed non-deterministic implementation.

Benchmarks and CI

Layer / File(s) Summary
NVBench base benchmark
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
Adds device functor benchmark_op_t<BlockThreads>, benchmark launcher block_reduce_warp_reductions, occupancy-based grid sizing with zero-grid guard, and NVBench registration.
Sum benchmark TU
cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
Adds value_types list and op_t = ::cuda::std::plus<>, includes the shared base implementation.
CI benchmark filters & GPU list
ci/bench.yaml
Adds two reduce benchmark filter regexes for new benchmarks and enables rtxa6000 in the CI GPU selection list.

Assessment against linked issues

Objective Addressed Explanation
Replace sequential per-warp aggregation with low-latency warp reduction (#5167)
Use __reduce_sync_op or equivalent on supported SM targets (#5167)
Reduce compile-time overhead from recursive ApplyWarpAggregates calls (#5167) The IsDeterministic branch and one non-deterministic function were removed, but the change retains templated FullTile/type/operator paths; net compile-time instantiation reduction is unclear from diffs provided.

Out-of-scope changes

Code Change Explanation
Add NVBench benchmark base (cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh, lines 1-54) Benchmark/test artifact; not required by the BlockReduce optimization objective.
Add benchmark TU for sum (cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu, lines 1-35) Benchmark wiring and type-list instantiation are testing artifacts, not part of core optimization.
Update CI bench filters and GPU list (ci/bench.yaml, lines 35-36 and 49-53) CI configuration updates are operational/test routing changes unrelated to BlockReduce implementation goals.

Comment @coderabbitai help to get the list of available commands and usage tips.

@fbusato
Copy link
Copy Markdown
Contributor

fbusato commented May 21, 2026

@charan-003, thanks a lot for the contribution.

Even before starting the review, I would like to see if there are actual performance benefits for this approach.
I would also encourage you to write a device-side benchmark for it, similarly to WarpReduce

@charan-003
Copy link
Copy Markdown
Contributor Author

@charan-003, thanks a lot for the contribution.

Even before starting the review, I would like to see if there are actual performance benefits for this approach. I would also encourage you to write a device-side benchmark for it, similarly to WarpReduce

sure, working on it

@charan-003 charan-003 force-pushed the optimize_blockreduce branch from 0b48748 to cb6dd6c Compare May 21, 2026 22:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cub/cub/block/specializations/block_reduce_warp_reductions.cuh (2)

125-125: 💤 Low value

suggestion: Use static_cast<int>(warps) instead of C-style cast int(warps) for consistency with the static_cast used in the same expression.

-      const int num_warps = FullTile ? int(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));
+      const int num_warps = FullTile ? static_cast<int>(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));

148-148: 💤 Low value

suggestion: Per coding guidelines, variables not modified should be const. dummy_storage is only passed to the constructor.

-      NullType dummy_storage;
+      const NullType dummy_storage{};

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ffc63f63-2430-4ac6-8a11-ae99c579665d

📥 Commits

Reviewing files that changed from the base of the PR and between 0b48748 and cb6dd6c.

📒 Files selected for processing (1)
  • cub/cub/block/specializations/block_reduce_warp_reductions.cuh

@fbusato
Copy link
Copy Markdown
Contributor

fbusato commented May 21, 2026

@charan-003 thanks for reporting them but I think a better approach would be:

  • Add the code for the device-side benchmark.
  • Modify cccl/ci/bench.yaml to select the GPUs where to run the benchmarks.
  • Add a commit with the message [bench-only] to run them.
  • Report only the comparison results, e.g. SLOW, FAST, etc.

@charan-003
Copy link
Copy Markdown
Contributor Author

charan-003 commented May 22, 2026

@charan-003 thanks for reporting them but I think a better approach would be:

  • Add the code for the device-side benchmark.
  • Modify cccl/ci/bench.yaml to select the GPUs where to run the benchmarks.
  • Add a commit with the message [bench-only] to run them.
  • Report only the comparison results, e.g. SLOW, FAST, etc.

Thanks a lot for the guidance! I didn't know the exact process...
Let me add device-side benchmark code

@charan-003 charan-003 requested review from a team as code owners May 22, 2026 01:11
@charan-003 charan-003 requested a review from wmaxey May 22, 2026 01:11
@charan-003 charan-003 force-pushed the optimize_blockreduce branch from 830c2b4 to a949e20 Compare May 22, 2026 01:13
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 53e3ccfa-d3a5-484a-85fd-7cf18e522152

📥 Commits

Reviewing files that changed from the base of the PR and between cb6dd6c and 830c2b4.

📒 Files selected for processing (3)
  • ci/bench.yaml
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
✅ Files skipped from review due to trivial changes (1)
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu

Comment thread ci/bench.yaml
template <int BlockThreads>
struct benchmark_op_t
{
template <typename T>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

suggestion: Line 17 and Line 35 use a single-letter template parameter (T). Use _Tp-style naming here to satisfy the template-parameter naming rule for this codebase.

As per coding guidelines "Avoid single letter names for template parameters; use _Tp instead of _T".

Also applies to: 35-35

Comment on lines +18 to +19
__device__ __forceinline__ T operator()(T thread_data) const
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read-only verification: confirm missing CCCL API annotations / noexcept in this file
rg -nP '^\s*(template\s*<[^>]+>\s*)?(?!.*_CCCL_(HOST_API|DEVICE_API|API)).*\b(operator\(\)|block_reduce_warp_reductions)\b' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -nP '\b(operator\(\)|block_reduce_warp_reductions)\b(?!.*noexcept)' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 391


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1,120p' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh | nl -ba | sed -n '10,80p'

Repository: NVIDIA/cccl

Length of output: 98


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "operator\(\)|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 289


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Capture the full function signatures/headers for the two targets (best-effort)
perl -0777 -ne '
my $f=$_; 
if($f =~ /(operator\(\)\s*const[^\n]*\n\s*\{)/s){ print "FOUND operator():\n$1\n"; }
if($f =~ /(block_reduce_warp_reductions[^\n]*\n\s*\{)/s){ print "FOUND block_reduce_warp_reductions():\n$1\n"; }
' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 170


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the relevant top portion with line numbers (no nl dependency)
awk '{printf "%5d:%s\n", NR, $0}' cub/benchmarks/bench/reduce/block_reduce_warp_warp_reductions_base.cuh | sed -n '1,110p'

Repository: NVIDIA/cccl

Length of output: 196


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check for noexcept / CCCL API tokens near the declarations
rg -n "operator\(\)|block_reduce_warp_reductions" -n cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -n --context=3 "operator\(\)|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -n --context=0 "_CCCL_(HOST_API|DEVICE_API|API)" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 1049


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for exception-throwing in the relevant functions/bodies
rg -n "\bthrow\b" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh

Repository: NVIDIA/cccl

Length of output: 37


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Print a small, line-numbered window around operator() and block_reduce_warp_reductions
sed -n '10,60p' "$FILE" | awk '{printf "%5d:%s\n", NR+10-1, $0}'

Repository: NVIDIA/cccl

Length of output: 2048


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Check for _CCCL_* usage and noexcept in/near the relevant symbols
rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE"
rg -n "noexcept" "$FILE"
rg -n --context=2 "operator\(\)" "$FILE"
rg -n --context=2 "block_reduce_warp_reductions" "$FILE"

Repository: NVIDIA/cccl

Length of output: 37


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Look for exception throwing inside this file
rg -n "\bthrow\b" "$FILE"

Repository: NVIDIA/cccl

Length of output: 37


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"

echo "== _CCCL_* tokens =="
rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE" || true

echo "== noexcept tokens =="
rg -n "noexcept" "$FILE" || true

echo "== throw tokens =="
rg -n "\bthrow\b" "$FILE" || true

Repository: NVIDIA/cccl

Length of output: 117


important: Add required _CCCL_* API annotations and noexcept to benchmark functions.

  • benchmark_op_t::operator() (line 18) lacks _CCCL_DEVICE_API and noexcept.
  • block_reduce_warp_reductions (line 36) lacks _CCCL_HOST_API and noexcept.
Suggested change
-  __device__ __forceinline__ T operator()(T thread_data) const
+  _CCCL_DEVICE_API __forceinline__ auto operator()(_Tp thread_data) const noexcept -> _Tp
-template <typename T>
-void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<T>)
+template <typename _Tp>
+_CCCL_HOST_API void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<_Tp>) noexcept

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 519c38f3-5544-4f5a-9750-fc04a66589e0

📥 Commits

Reviewing files that changed from the base of the PR and between 830c2b4 and a949e20.

📒 Files selected for processing (3)
  • ci/bench.yaml
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
  • cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
✅ Files skipped from review due to trivial changes (1)
  • ci/bench.yaml

Comment thread cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
@fbusato
Copy link
Copy Markdown
Contributor

fbusato commented May 22, 2026

/ok to test 80b5ac2

@github-actions
Copy link
Copy Markdown
Contributor

❌ Benchmark Results

Benchmark comparison had failures.

Results
Artifacts

@charan-003
Copy link
Copy Markdown
Contributor Author

@fbusato the benchmarks shows a slowdown. can we use bit_ceil(warps) for the logical warp size. Does this approach look reasonable, or would you suggest a different direction?

@fbusato
Copy link
Copy Markdown
Contributor

fbusato commented May 22, 2026

The code looks good and I really like the simplifications. Related to performance, some high-level thoughts:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Optimize BlockReduce

2 participants