Expose `max_segment_size` guarantee in cuda.comptue by shwina · Pull Request #8284 · NVIDIA/cccl

shwina · 2026-04-02T21:26:43Z

Description

Closes #8277.

Adds an optional max_segment_size parameter to segmented_reduce which can be used by the implementation to dispatch to the optimal kernel for the given segment size.

An informal benchmark to show speedup from specifying the max_segment_size (code). As expected, it only really makes a difference for small and medium segment sizes, as large segments benefit from the default (block per segment):

  seg_size      num_segs    no hint (ms)     hint (ms)   speedup
--------------------------------------------------------------------
         1    67,108,864           73.03          1.85    39.44x
         4    16,777,216           18.16          0.70    25.97x
        16     4,194,304            4.57          0.41    11.05x
        64     1,048,576            1.16          0.33     3.47x
       256       262,144            0.35          0.32     1.09x
     1,024        65,536            0.31          0.31     1.00x
     4,096        16,384            0.31          0.31     1.00x
    16,384         4,096            0.31          0.31     1.00x
    65,536         1,024            0.31          0.31     1.00x

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

github-actions · 2026-04-02T23:20:41Z

😬 CI Workflow Results

🟥 Finished in 1h 52m: Pass: 82%/58 | Total: 17h 40m | Max: 59m 16s | Hits: 98%/316

See results here.

NaderAlAwar · 2026-04-03T14:55:05Z

python/cuda_cccl/cuda/compute/algorithms/_segmented_reduce.py

    op: Operator,
    h_init: np.ndarray | GpuStruct,
    num_segments: int,
+    max_segment_size: int | None = None,


Question: Did you consider making this a kwarg, like we did for determinism in reduce? My worry with this approach is that the user would have to specifythe max_segment_size to None if they want to pass a strema

NaderAlAwar · 2026-04-03T14:58:00Z

c/parallel/include/cccl/c/segmented_reduce.h

  cccl_iterator_t end_offset_in,
  cccl_op_t op,
  cccl_value_t init,
+  size_t max_segment_size,


Important: please add cccl.c tests that exercise the new max_segment_size paths

NaderAlAwar · 2026-04-03T15:04:57Z

python/cuda_cccl/benchmarks/compute/segmented_reduce/sum.py

+
+Notes:
+- Implements three sub-benchmarks: small, medium, large (by SegmentSize)
+- The C++ equivalent uses DispatchFixedSizeSegmentedReduce; the Python API uses


Important: I would remove this benchmark because it is meant to showcase the fixed size segmented reduce which we do not expose. We want to show parity with C++ performance, and at present this benchmark will likely not do that.

NaderAlAwar · 2026-04-03T15:05:32Z

python/cuda_cccl/benchmarks/compute/segmented_reduce/variable_sum.py

Important: we should add this benchmark to run_benchmarks.py and quick_configs.yaml

NaderAlAwar · 2026-04-03T15:14:18Z

python/cuda_cccl/benchmarks/compute/segmented_reduce/variable_sum.py

+import cupy as cp
+import numpy as np
+from utils import (
+    FUNDAMENTAL_TYPES as TYPE_MAP,


Important: the C++ benchmark only uses int32/64 and float32/64, so we should match the exact types used there instead of all of them. You can see how we do this in nondeterministic.py:

TYPE_MAP = {k: _ALL_TYPES[k] for k in ("I32", "I64", "F32", "F64")}

Expose max_segment_size guarantee in cuda.compute

e9a38d2

github-project-automation bot added this to CCCL Apr 2, 2026

github-project-automation bot moved this to Todo in CCCL Apr 2, 2026

shwina requested review from a team as code owners April 2, 2026 21:26

shwina requested a review from oleksandr-pavlyk April 2, 2026 21:26

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 2, 2026

NaderAlAwar reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose `max_segment_size` guarantee in cuda.comptue#8284

Expose `max_segment_size` guarantee in cuda.comptue#8284
shwina wants to merge 1 commit intoNVIDIA:mainfrom
shwina:expose-segmented-reduce-max-segment-size

shwina commented Apr 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

NaderAlAwar Apr 3, 2026

Uh oh!

NaderAlAwar Apr 3, 2026

Uh oh!

NaderAlAwar Apr 3, 2026

Uh oh!

NaderAlAwar Apr 3, 2026

Uh oh!

NaderAlAwar Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shwina commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

github-actions bot commented Apr 2, 2026

😬 CI Workflow Results

🟥 Finished in 1h 52m: Pass: 82%/58 | Total: 17h 40m | Max: 59m 16s | Hits: 98%/316

Uh oh!

NaderAlAwar Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shwina commented Apr 2, 2026 •

edited

Loading