Skip to content

Expose max_segment_size guarantee in cuda.comptue#8284

Open
shwina wants to merge 1 commit intoNVIDIA:mainfrom
shwina:expose-segmented-reduce-max-segment-size
Open

Expose max_segment_size guarantee in cuda.comptue#8284
shwina wants to merge 1 commit intoNVIDIA:mainfrom
shwina:expose-segmented-reduce-max-segment-size

Conversation

@shwina
Copy link
Copy Markdown
Contributor

@shwina shwina commented Apr 2, 2026

Description

Closes #8277.

Adds an optional max_segment_size parameter to segmented_reduce which can be used by the implementation to dispatch to the optimal kernel for the given segment size.

An informal benchmark to show speedup from specifying the max_segment_size (code). As expected, it only really makes a difference for small and medium segment sizes, as large segments benefit from the default (block per segment):

  seg_size      num_segs    no hint (ms)     hint (ms)   speedup
--------------------------------------------------------------------
         1    67,108,864           73.03          1.85    39.44x
         4    16,777,216           18.16          0.70    25.97x
        16     4,194,304            4.57          0.41    11.05x
        64     1,048,576            1.16          0.33     3.47x
       256       262,144            0.35          0.32     1.09x
     1,024        65,536            0.31          0.31     1.00x
     4,096        16,384            0.31          0.31     1.00x
    16,384         4,096            0.31          0.31     1.00x
    65,536         1,024            0.31          0.31     1.00x

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-project-automation github-project-automation bot moved this to Todo in CCCL Apr 2, 2026
@shwina shwina requested review from a team as code owners April 2, 2026 21:26
@shwina shwina requested a review from oleksandr-pavlyk April 2, 2026 21:26
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

😬 CI Workflow Results

🟥 Finished in 1h 52m: Pass: 82%/58 | Total: 17h 40m | Max: 59m 16s | Hits: 98%/316

See results here.

op: Operator,
h_init: np.ndarray | GpuStruct,
num_segments: int,
max_segment_size: int | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Did you consider making this a kwarg, like we did for determinism in reduce? My worry with this approach is that the user would have to specifythe max_segment_size to None if they want to pass a strema

cccl_iterator_t end_offset_in,
cccl_op_t op,
cccl_value_t init,
size_t max_segment_size,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: please add cccl.c tests that exercise the new max_segment_size paths


Notes:
- Implements three sub-benchmarks: small, medium, large (by SegmentSize)
- The C++ equivalent uses DispatchFixedSizeSegmentedReduce; the Python API uses
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: I would remove this benchmark because it is meant to showcase the fixed size segmented reduce which we do not expose. We want to show parity with C++ performance, and at present this benchmark will likely not do that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: we should add this benchmark to run_benchmarks.py and quick_configs.yaml

import cupy as cp
import numpy as np
from utils import (
FUNDAMENTAL_TYPES as TYPE_MAP,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: the C++ benchmark only uses int32/64 and float32/64, so we should match the exact types used there instead of all of them. You can see how we do this in nondeterministic.py:

TYPE_MAP = {k: _ALL_TYPES[k] for k in ("I32", "I64", "F32", "F64")}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[cuda.compute]: Update segmented_reduce to accept max_segment_size argument

2 participants