optimize blockReduceWarp#9101
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 Walkthroughsuggestion: WalkthroughRefactors BlockReduce warp-aggregate application so warp 0 deterministically cooperates to reduce per-warp aggregates (reduce_op_sync fast path, shuffle fallback), removes the atomic non-deterministic path, consolidates Sum/Reduce, and adds NVBench benchmark plus CI bench filter/GPU updates. ChangesWarp-0 Cooperative Reduction
Benchmarks and CI
Assessment against linked issues
Out-of-scope changes
Comment |
|
@charan-003, thanks a lot for the contribution. Even before starting the review, I would like to see if there are actual performance benefits for this approach. |
sure, working on it |
0b48748 to
cb6dd6c
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cub/cub/block/specializations/block_reduce_warp_reductions.cuh (2)
125-125: 💤 Low valuesuggestion: Use
static_cast<int>(warps)instead of C-style castint(warps)for consistency with thestatic_castused in the same expression.- const int num_warps = FullTile ? int(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size)); + const int num_warps = FullTile ? static_cast<int>(warps) : static_cast<int>(::cuda::ceil_div(num_valid, logical_warp_size));
148-148: 💤 Low valuesuggestion: Per coding guidelines, variables not modified should be
const.dummy_storageis only passed to the constructor.- NullType dummy_storage; + const NullType dummy_storage{};
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ffc63f63-2430-4ac6-8a11-ae99c579665d
📒 Files selected for processing (1)
cub/cub/block/specializations/block_reduce_warp_reductions.cuh
|
@charan-003 thanks for reporting them but I think a better approach would be:
|
Thanks a lot for the guidance! I didn't know the exact process... |
830c2b4 to
a949e20
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 53e3ccfa-d3a5-484a-85fd-7cf18e522152
📒 Files selected for processing (3)
ci/bench.yamlcub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuhcub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
✅ Files skipped from review due to trivial changes (1)
- cub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
| template <int BlockThreads> | ||
| struct benchmark_op_t | ||
| { | ||
| template <typename T> |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
suggestion: Line 17 and Line 35 use a single-letter template parameter (T). Use _Tp-style naming here to satisfy the template-parameter naming rule for this codebase.
As per coding guidelines "Avoid single letter names for template parameters; use _Tp instead of _T".
Also applies to: 35-35
| __device__ __forceinline__ T operator()(T thread_data) const | ||
| { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Read-only verification: confirm missing CCCL API annotations / noexcept in this file
rg -nP '^\s*(template\s*<[^>]+>\s*)?(?!.*_CCCL_(HOST_API|DEVICE_API|API)).*\b(operator\(\)|block_reduce_warp_reductions)\b' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -nP '\b(operator\(\)|block_reduce_warp_reductions)\b(?!.*noexcept)' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuhRepository: NVIDIA/cccl
Length of output: 391
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,120p' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh | nl -ba | sed -n '10,80p'Repository: NVIDIA/cccl
Length of output: 98
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "operator\(\)|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuhRepository: NVIDIA/cccl
Length of output: 289
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Capture the full function signatures/headers for the two targets (best-effort)
perl -0777 -ne '
my $f=$_;
if($f =~ /(operator\(\)\s*const[^\n]*\n\s*\{)/s){ print "FOUND operator():\n$1\n"; }
if($f =~ /(block_reduce_warp_reductions[^\n]*\n\s*\{)/s){ print "FOUND block_reduce_warp_reductions():\n$1\n"; }
' cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuhRepository: NVIDIA/cccl
Length of output: 170
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the relevant top portion with line numbers (no nl dependency)
awk '{printf "%5d:%s\n", NR, $0}' cub/benchmarks/bench/reduce/block_reduce_warp_warp_reductions_base.cuh | sed -n '1,110p'Repository: NVIDIA/cccl
Length of output: 196
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check for noexcept / CCCL API tokens near the declarations
rg -n "operator\(\)|block_reduce_warp_reductions" -n cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -n --context=3 "operator\(\)|block_reduce_warp_reductions" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh
rg -n --context=0 "_CCCL_(HOST_API|DEVICE_API|API)" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuhRepository: NVIDIA/cccl
Length of output: 1049
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for exception-throwing in the relevant functions/bodies
rg -n "\bthrow\b" cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuhRepository: NVIDIA/cccl
Length of output: 37
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Print a small, line-numbered window around operator() and block_reduce_warp_reductions
sed -n '10,60p' "$FILE" | awk '{printf "%5d:%s\n", NR+10-1, $0}'Repository: NVIDIA/cccl
Length of output: 2048
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Check for _CCCL_* usage and noexcept in/near the relevant symbols
rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE"
rg -n "noexcept" "$FILE"
rg -n --context=2 "operator\(\)" "$FILE"
rg -n --context=2 "block_reduce_warp_reductions" "$FILE"Repository: NVIDIA/cccl
Length of output: 37
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
# Look for exception throwing inside this file
rg -n "\bthrow\b" "$FILE"Repository: NVIDIA/cccl
Length of output: 37
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="cub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuh"
echo "== _CCCL_* tokens =="
rg -n "_CCCL_(HOST_API|DEVICE_API|API)" "$FILE" || true
echo "== noexcept tokens =="
rg -n "noexcept" "$FILE" || true
echo "== throw tokens =="
rg -n "\bthrow\b" "$FILE" || trueRepository: NVIDIA/cccl
Length of output: 117
important: Add required _CCCL_* API annotations and noexcept to benchmark functions.
benchmark_op_t::operator()(line 18) lacks_CCCL_DEVICE_APIandnoexcept.block_reduce_warp_reductions(line 36) lacks_CCCL_HOST_APIandnoexcept.
Suggested change
- __device__ __forceinline__ T operator()(T thread_data) const
+ _CCCL_DEVICE_API __forceinline__ auto operator()(_Tp thread_data) const noexcept -> _Tp-template <typename T>
-void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<T>)
+template <typename _Tp>
+_CCCL_HOST_API void block_reduce_warp_reductions(nvbench::state& state, nvbench::type_list<_Tp>) noexceptThere was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 519c38f3-5544-4f5a-9750-fc04a66589e0
📒 Files selected for processing (3)
ci/bench.yamlcub/benchmarks/bench/reduce/block_reduce_warp_reductions_base.cuhcub/benchmarks/bench/reduce/block_reduce_warp_reductions_sum.cu
✅ Files skipped from review due to trivial changes (1)
- ci/bench.yaml
|
/ok to test 80b5ac2 |
|
@fbusato the benchmarks shows a slowdown. can we use bit_ceil(warps) for the logical warp size. Does this approach look reasonable, or would you suggest a different direction? |
|
The code looks good and I really like the simplifications. Related to performance, some high-level thoughts:
|
Closes #5167
Replace sequential aggregation with cooperative warp reduction