Expose max_segment_size guarantee in cuda.comptue#8284
Expose max_segment_size guarantee in cuda.comptue#8284shwina wants to merge 1 commit intoNVIDIA:mainfrom
max_segment_size guarantee in cuda.comptue#8284Conversation
😬 CI Workflow Results🟥 Finished in 1h 52m: Pass: 82%/58 | Total: 17h 40m | Max: 59m 16s | Hits: 98%/316See results here. |
| op: Operator, | ||
| h_init: np.ndarray | GpuStruct, | ||
| num_segments: int, | ||
| max_segment_size: int | None = None, |
There was a problem hiding this comment.
Question: Did you consider making this a kwarg, like we did for determinism in reduce? My worry with this approach is that the user would have to specifythe max_segment_size to None if they want to pass a strema
| cccl_iterator_t end_offset_in, | ||
| cccl_op_t op, | ||
| cccl_value_t init, | ||
| size_t max_segment_size, |
There was a problem hiding this comment.
Important: please add cccl.c tests that exercise the new max_segment_size paths
|
|
||
| Notes: | ||
| - Implements three sub-benchmarks: small, medium, large (by SegmentSize) | ||
| - The C++ equivalent uses DispatchFixedSizeSegmentedReduce; the Python API uses |
There was a problem hiding this comment.
Important: I would remove this benchmark because it is meant to showcase the fixed size segmented reduce which we do not expose. We want to show parity with C++ performance, and at present this benchmark will likely not do that.
There was a problem hiding this comment.
Important: we should add this benchmark to run_benchmarks.py and quick_configs.yaml
| import cupy as cp | ||
| import numpy as np | ||
| from utils import ( | ||
| FUNDAMENTAL_TYPES as TYPE_MAP, |
There was a problem hiding this comment.
Important: the C++ benchmark only uses int32/64 and float32/64, so we should match the exact types used there instead of all of them. You can see how we do this in nondeterministic.py:
TYPE_MAP = {k: _ALL_TYPES[k] for k in ("I32", "I64", "F32", "F64")}
Description
Closes #8277.
Adds an optional
max_segment_sizeparameter tosegmented_reducewhich can be used by the implementation to dispatch to the optimal kernel for the given segment size.An informal benchmark to show speedup from specifying the
max_segment_size(code). As expected, it only really makes a difference for small and medium segment sizes, as large segments benefit from the default (block per segment):Checklist