Skip to content

CUB DeviceRotate#9061

Draft
mfranzrebsal wants to merge 1 commit into
NVIDIA:mainfrom
mfranzrebsal:cub-device-rotate
Draft

CUB DeviceRotate#9061
mfranzrebsal wants to merge 1 commit into
NVIDIA:mainfrom
mfranzrebsal:cub-device-rotate

Conversation

@mfranzrebsal
Copy link
Copy Markdown

@mfranzrebsal mfranzrebsal commented May 18, 2026

Description

This MR adds a device-wide implementation of the in-place rotate algorithm for CUB. The changes contain three implementations (short, long, and naive), since the first two (which are truly in-place) can only be used when the rotate distance satisfies some specific criteria. For more information about the algorithms, please take a look at these (NVIDIA internal) slides.

For the short algorithm, I have added two implementations: one that launches as many CTAs per SM as possible, and another that instead uses pipelining combined with a different version of the algorithm. The motivation behind the pipelined version was to try to achieve SOL on H100, but when comparing profiles (link) on H100 PCIe (attached), the pipelined version is 15% slower, while the (to me) relevant metrics compare as follows:

  1. 13% lower mem throughput.
  2. 5% less shared loads, 10% less wavefronts.
  3. No shared stores.
  4. 4KB global loads instead of 5MB
  5. No global atomics.
  6. 70% less stalls:
    a. ⅓ long scoreboard.
    b. 18x less barrier.
  7. 13% more instructions:
  8. 12% more branch instructions.

I would like to understand how/why the pipelined version performs worse, since theoretically the algorithm seems superior, and the metrics mostly look better. And what optimizations would be necessary to reach SOL on H100 and even B100? I'd be very interested to discuss this! Which version to use can be modulated by changing the variable constexpr bool USE_SHORT_PIPELINE.

The way the long algorithm works is also a bit unusual since, when querying the scratch space, there is a RotateState_t struct that also gets filled and must be passed to the following function call unchanged, in order to avoid repeating the dependency graph creation. Would this be acceptable to have in CUB? Alternatively, if we allow copying of the first K (= rotate distance) elements to a scratch buffer we would know longer need that dependency graph. The downside of this, is that the memory movement is now 2N + 2K instead of 2N, which would substantially reduce the theoretically achievable performance when K ~ N/2.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Benchmark Results

RotatePercentage = 0 means a rotate distance of 1. Interesting here is how the performance does not really change between 1B and 8B data types, indicating that the extra work needed for the 1B type is "free".

rotate_benchmark

[0] NVIDIA H100 80GB HBM3

[0] NVIDIA H100 80GB HBM3

| T{ct} |    Bytes{io}     | RotatePercentage | NumUnalignedElems | Algorithm |  Samples |  CPU Time  |  Noise  |  GPU Time  | Noise |  Elem/s  | GlobalMem BW | BWUtil |
|-------|------------------|------------------|-------------------|-----------|- --------|------------|---------|------------|-------|----------|--------------|- -------|
|    U8 |     2^16 = 65536 |                0 |                 1 |     short |  46144x |  33.894 us | 111.06% |  10.837 us | 4.76% |   6.048G |  12.095 GB/s | 0.36% |
|    U8 |   2^20 = 1048576 |                0 |                 1 |     short |  41824x |  35.414 us |   4.60% |  11.958 us | 4.89% |  87.686G | 175.371 GB/s | 5.23% |
|    U8 |  2^24 = 16777216 |                0 |                 1 |     short |  20480x |  47.657 us |   4.18% |  24.419 us | 3.58% | 687.056G |   1.374 TB/s | 40.99% |
|    U8 | 2^28 = 268435456 |                0 |                 1 |     short |  2128x | 259.734 us |   0.92% | 236.669 us | 0.88% |   1.134T |   2.268 TB/s | 67.67% |
|    U8 |     2^16 = 65536 |             0.01 |                 1 |     short |  41776x |  35.682 us |   4.45% |  11.970 us | 3.58% |   5.475G |  10.950 GB/s | 0.33% |
|    U8 |   2^20 = 1048576 |             0.01 |                 1 |     short |  18272x |  51.031 us |   3.57% |  27.371 us | 2.40% |  38.310G |  76.620 GB/s | 2.29% |
|    U8 |  2^24 = 16777216 |             0.01 |                 1 |      long |  18384x |  53.162 us |   3.66% |  27.200 us | 4.25% | 616.801G |   1.234 TB/s | 36.80% |
|    U8 | 2^28 = 268435456 |             0.01 |                 1 |      long |  2000x | 279.249 us |   1.36% | 251.571 us | 0.81% |   1.067T |   2.134 TB/s | 63.66% |
|    U8 |     2^16 = 65536 |              0.3 |                 1 |     short |  12592x |  62.828 us |   2.74% |  39.736 us | 2.18% |   1.649G |   3.299 GB/s | 0.10% |
|    U8 |   2^20 = 1048576 |              0.3 |                 1 |      long |  37472x |  39.239 us |   3.98% |  13.343 us | 2.46% |  78.583G | 157.167 GB/s | 4.69% |
|    U8 |  2^24 = 16777216 |              0.3 |                 1 |      long |  18688x |  52.944 us |   3.46% |  26.766 us | 3.14% | 626.800G |   1.254 TB/s | 37.39% |
|    U8 | 2^28 = 268435456 |              0.3 |                 1 |      long |  2048x | 275.229 us |   0.88% | 247.331 us | 0.78% |   1.085T |   2.171 TB/s | 64.75% |
|    U8 |     2^16 = 65536 |              0.6 |                 1 |      long |  39472x |  38.926 us |   4.96% |  12.672 us | 1.56% |   5.172G |  10.343 GB/s | 0.31% |
|    U8 |   2^20 = 1048576 |              0.6 |                 1 |      long |  37360x |  39.747 us |   4.57% |  13.387 us | 2.52% |  78.330G | 156.659 GB/s | 4.67% |
|    U8 |  2^24 = 16777216 |              0.6 |                 1 |      long |  18368x |  53.530 us |   3.86% |  27.241 us | 4.10% | 615.890G |   1.232 TB/s | 36.74% |
|    U8 | 2^28 = 268435456 |              0.6 |                 1 |      long |  2000x | 279.710 us |   0.96% | 251.934 us | 0.81% |   1.066T |   2.131 TB/s | 63.57% |
|    U8 |     2^16 = 65536 |              0.9 |                 1 |      long |  40128x |  38.666 us |   4.83% |  12.460 us | 1.72% |   5.260G |  10.519 GB/s | 0.31% |
|    U8 |   2^20 = 1048576 |              0.9 |                 1 |      long |  37520x |  39.373 us |   4.41% |  13.329 us | 2.60% |  78.669G | 157.338 GB/s | 4.69% |
|    U8 |  2^24 = 16777216 |              0.9 |                 1 |      long |  19136x |  52.303 us |   3.54% |  26.132 us | 2.24% | 642.026G |   1.284 TB/s | 38.30% |
|    U8 | 2^28 = 268435456 |              0.9 |                 1 |      long |  2032x | 274.788 us |   0.93% | 246.708 us | 0.82% |   1.088T |   2.176 TB/s | 64.91% |
|   U16 |     2^16 = 65536 |                0 |                 1 |     short |  45040x |  34.610 us |   4.59% |  11.103 us | 4.86% |   2.951G |  11.805 GB/s | 0.35% |
|   U16 |   2^20 = 1048576 |                0 |                 1 |     short |  42528x |  35.665 us |   4.62% |  11.759 us | 4.38% |  44.584G | 178.337 GB/s | 5.32% |
|   U16 |  2^24 = 16777216 |                0 |                 1 |     short |  20672x |  48.036 us |   3.72% |  24.206 us | 3.00% | 346.557G |   1.386 TB/s | 41.35% |
|   U16 | 2^28 = 268435456 |                0 |                 1 |     short |  2080x | 265.741 us |   0.79% | 241.706 us | 0.61% | 555.293G |   2.221 TB/s | 66.26% |
|   U16 |     2^16 = 65536 |             0.01 |                 1 |     short |  42752x |  35.099 us |   4.64% |  11.698 us | 4.42% |   2.801G |  11.204 GB/s | 0.33% |
|   U16 |   2^20 = 1048576 |             0.01 |                 1 |     short |  24736x |  44.359 us |   3.69% |  20.221 us | 2.55% |  25.928G | 103.711 GB/s | 3.09% |
|   U16 |  2^24 = 16777216 |             0.01 |                 1 |      long |  17808x |  54.677 us |   3.95% |  28.102 us | 5.42% | 298.508G |   1.194 TB/s | 35.62% |
|   U16 | 2^28 = 268435456 |             0.01 |                 1 |      long |  2000x | 278.872 us |   0.97% | 250.929 us | 0.56% | 534.882G |   2.140 TB/s | 63.82% |
|   U16 |     2^16 = 65536 |              0.3 |                 1 |     short |  19136x |  49.925 us |   2.96% |  26.145 us | 2.12% |   1.253G |   5.013 GB/s | 0.15% |
|   U16 |   2^20 = 1048576 |              0.3 |                 1 |      long |  36640x |  40.262 us |   4.28% |  13.650 us | 2.64% |  38.408G | 153.632 GB/s | 4.58% |
|   U16 |  2^24 = 16777216 |              0.3 |                 1 |      long |  18256x |  53.619 us |   4.82% |  27.402 us | 4.39% | 306.135G |   1.225 TB/s | 36.53% |
|   U16 | 2^28 = 268435456 |              0.3 |                 1 |      long |  2022x | 275.156 us |   0.64% | 247.289 us | 0.48% | 542.757G |   2.171 TB/s | 64.76% |
|   U16 |     2^16 = 65536 |              0.6 |                 1 |      long |  39120x |  39.397 us |   4.61% |  12.782 us | 1.70% |   2.564G |  10.255 GB/s | 0.31% |
|   U16 |   2^20 = 1048576 |              0.6 |                 1 |      long |  36800x |  39.899 us |   4.38% |  13.589 us | 2.41% |  38.581G | 154.322 GB/s | 4.60% |
|   U16 |  2^24 = 16777216 |              0.6 |                 1 |      long |  18368x |  53.623 us |   3.58% |  27.239 us | 4.70% | 307.960G |   1.232 TB/s | 36.75% |
|   U16 | 2^28 = 268435456 |              0.6 |                 1 |      long |  1984x | 279.861 us |   0.67% | 252.231 us | 0.55% | 532.123G |   2.128 TB/s | 63.49% |
|   U16 |     2^16 = 65536 |              0.9 |                 1 |      long |  39472x |  39.065 us |   3.93% |  12.668 us | 1.82% |   2.587G |  10.347 GB/s | 0.31% |
|   U16 |   2^20 = 1048576 |              0.9 |                 1 |      long |  36752x |  39.642 us |   3.39% |  13.609 us | 2.70% |  38.525G | 154.099 GB/s | 4.60% |
|   U16 |  2^24 = 16777216 |              0.9 |                 1 |      long |  18992x |  52.456 us |   2.97% |  26.339 us | 2.57% | 318.487G |   1.274 TB/s | 38.00% |
|   U16 | 2^28 = 268435456 |              0.9 |                 1 |      long |  2032x | 274.679 us |   0.65% | 247.064 us | 0.56% | 543.252G |   2.173 TB/s | 64.82% |
|   U32 |     2^16 = 65536 |                0 |                 1 |     short |  46976x |  34.282 us |   4.74% |  10.645 us | 3.98% |   1.539G |  12.312 GB/s | 0.37% |
|   U32 |   2^20 = 1048576 |                0 |                 1 |     short |  41792x |  35.660 us |   4.27% |  11.968 us | 3.43% |  21.903G | 175.224 GB/s | 5.23% |
|   U32 |  2^24 = 16777216 |                0 |                 1 |     short |  21680x |  47.134 us |   3.67% |  23.070 us | 2.96% | 181.811G |   1.454 TB/s | 43.39% |
|   U32 | 2^28 = 268435456 |                0 |                 1 |     short |  2272x | 243.951 us |   0.81% | 220.410 us | 0.67% | 304.473G |   2.436 TB/s | 72.66% |
|   U32 |     2^16 = 65536 |             0.01 |                 1 |     short |  45856x |  34.771 us |   4.38% |  10.907 us | 3.03% |   1.502G |  12.017 GB/s | 0.36% |
|   U32 |   2^20 = 1048576 |             0.01 |                 1 |     short |  32080x |  39.290 us |   4.39% |  15.591 us | 3.40% |  16.814G | 134.513 GB/s | 4.01% |
|   U32 |  2^24 = 16777216 |             0.01 |                 1 |      long |  19760x |  51.343 us |   9.95% |  25.304 us | 3.75% | 165.754G |   1.326 TB/s | 39.56% |
|   U32 | 2^28 = 268435456 |             0.01 |                 1 |      long |  2304x | 245.343 us |   0.81% | 217.462 us | 0.57% | 308.601G |   2.469 TB/s | 73.64% |
|   U32 |     2^16 = 65536 |              0.3 |                 1 |     short |  27712x |  41.685 us |   4.08% |  18.045 us | 2.35% | 907.931M |   7.263 GB/s | 0.22% |
|   U32 |   2^20 = 1048576 |              0.3 |                 1 |      long |  39024x |  39.199 us |   4.80% |  12.813 us | 2.57% |  20.459G | 163.675 GB/s | 4.88% |
|   U32 |  2^24 = 16777216 |              0.3 |                 1 |      long |  20272x |  51.157 us |   3.03% |  24.680 us | 1.74% | 169.949G |   1.360 TB/s | 40.56% |
|   U32 | 2^28 = 268435456 |              0.3 |                 1 |      long |  2258x | 249.763 us |   0.90% | 221.527 us | 0.49% | 302.937G |   2.423 TB/s | 72.29% |
|   U32 |     2^16 = 65536 |              0.6 |                 1 |      long |  40800x |  38.655 us |  14.24% |  12.258 us | 1.62% |   1.337G |  10.692 GB/s | 0.32% |
|   U32 |   2^20 = 1048576 |              0.6 |                 1 |      long |  39168x |  39.400 us |   5.05% |  12.766 us | 2.21% |  20.535G | 164.277 GB/s | 4.90% |
|   U32 |  2^24 = 16777216 |              0.6 |                 1 |      long |  20240x |  50.858 us |   3.02% |  24.712 us | 1.70% | 169.726G |   1.358 TB/s | 40.50% |
|   U32 | 2^28 = 268435456 |              0.6 |                 1 |      long |  2208x | 254.819 us |   0.70% | 226.636 us | 0.60% | 296.108G |   2.369 TB/s | 70.66% |
|   U32 |     2^16 = 65536 |              0.9 |                 1 |      long |  43136x |  38.048 us |   4.46% |  11.594 us | 2.58% |   1.413G |  11.305 GB/s | 0.34% |
|   U32 |   2^20 = 1048576 |              0.9 |                 1 |      long |  39760x |  38.981 us |   4.31% |  12.579 us | 2.08% |  20.840G | 166.723 GB/s | 4.97% |
|   U32 |  2^24 = 16777216 |              0.9 |                 1 |      long |  20560x |  50.189 us |   2.40% |  24.324 us | 1.22% | 172.438G |   1.380 TB/s | 41.15% |
|   U32 | 2^28 = 268435456 |              0.9 |                 1 |      long |  2272x | 248.510 us |   0.60% | 220.833 us | 0.54% | 303.890G |   2.431 TB/s | 72.52% |
|   U64 |     2^16 = 65536 |                0 |                 1 |     short |  47648x |  34.180 us |   4.90% |  10.494 us | 5.95% | 780.643M |  12.490 GB/s | 0.37% |
|   U64 |   2^20 = 1048576 |                0 |                 1 |     short |  41760x |  35.918 us |   4.55% |  11.975 us | 4.92% |  10.946G | 175.128 GB/s | 5.22% |
|   U64 |  2^24 = 16777216 |                0 |                 1 |     short |  22064x |  46.016 us |   2.96% |  22.674 us | 2.30% |  92.493G |   1.480 TB/s | 44.15% |
|   U64 | 2^28 = 268435456 |                0 |                 1 |     short |  2288x | 244.486 us |   0.68% | 219.435 us | 0.62% | 152.913G |   2.447 TB/s | 72.98% |
|   U64 |     2^16 = 65536 |             0.01 |                 1 |     short |  46928x |  34.014 us |   4.72% |  10.656 us | 3.89% | 768.753M |  12.300 GB/s | 0.37% |
|   U64 |   2^20 = 1048576 |             0.01 |                 1 |     short |  38496x |  36.333 us |  11.93% |  12.989 us | 5.22% |  10.091G | 161.454 GB/s | 4.82% |
|   U64 |  2^24 = 16777216 |             0.01 |                 1 |      long |  19952x |  51.008 us |   3.20% |  25.065 us | 3.34% |  83.669G |   1.339 TB/s | 39.93% |
|   U64 | 2^28 = 268435456 |             0.01 |                 1 |      long |  2304x | 245.746 us |   1.24% | 217.854 us | 0.59% | 154.022G |   2.464 TB/s | 73.51% |
|   U64 |     2^16 = 65536 |              0.3 |                 1 |     short |  35184x |  37.422 us |   4.01% |  14.211 us | 2.44% | 576.437M |   9.223 GB/s | 0.28% |
|   U64 |   2^20 = 1048576 |              0.3 |                 1 |      long |  39632x |  38.600 us |   3.96% |  12.619 us | 2.57% |  10.387G | 166.184 GB/s | 4.96% |
|   U64 |  2^24 = 16777216 |              0.3 |                 1 |      long |  20480x |  50.823 us |   3.38% |  24.418 us | 1.65% |  85.887G |   1.374 TB/s | 40.99% |
|   U64 | 2^28 = 268435456 |              0.3 |                 1 |      long |  2258x | 249.375 us |   0.81% | 221.495 us | 0.49% | 151.491G |   2.424 TB/s | 72.30% |
|   U64 |     2^16 = 65536 |              0.6 |                 1 |      long |  41136x |  38.147 us |   4.69% |  12.158 us | 1.45% | 673.799M |  10.781 GB/s | 0.32% |
|   U64 |   2^20 = 1048576 |              0.6 |                 1 |      long |  39904x |  38.746 us |   4.62% |  12.530 us | 2.63% |  10.461G | 167.369 GB/s | 4.99% |
|   U64 |  2^24 = 16777216 |              0.6 |                 1 |      long |  20352x |  50.128 us |   2.68% |  24.574 us | 2.06% |  85.339G |   1.365 TB/s | 40.73% |
|   U64 | 2^28 = 268435456 |              0.6 |                 1 |      long |  2208x | 253.839 us |   0.77% | 226.465 us | 0.61% | 148.166G |   2.371 TB/s | 70.72% |
|   U64 |     2^16 = 65536 |              0.9 |                 1 |      long |  44288x |  37.328 us |   4.24% |  11.292 us | 1.76% | 725.465M |  11.607 GB/s | 0.35% |
|   U64 |   2^20 = 1048576 |              0.9 |                 1 |      long |  41792x |  38.166 us |   4.32% |  11.968 us | 2.64% |  10.952G | 175.235 GB/s | 5.23% |
|   U64 |  2^24 = 16777216 |              0.9 |                 1 |      long |  20912x |  49.680 us |   3.29% |  23.926 us | 1.71% |  87.651G |   1.402 TB/s | 41.83% |
|   U64 | 2^28 = 268435456 |              0.9 |                 1 |      long |  2304x | 245.565 us |   0.88% | 217.764 us | 0.53% | 154.086G |   2.465 TB/s | 73.54% |

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 18, 2026
@bernhardmgruber
Copy link
Copy Markdown
Contributor

Thank you for the pull request! I looked briefly through the implementation and there is still a lot of work left to align it to how CUB works. I don't have time for this over the next two weeks, maybe somebody else can help. I think we should address:

  • changing the unit tests to use launch wrappers, so they test device-side use and graph capture uniformly
  • properly separating out all tuning parameters into tuning policies and use a policy selector
  • replace inline PTX by intrinsics
  • get rid of CG (does not work well with nvc++). The invoke_one should not be necessary, cuda::memcpy_async does that internally if you pass a the right group (would need to look up the implementation)
  • get rid of the RotateState_t at the public API. if it is needed it must be created on the fly during the CUB algorithm's invocation
  • benchmark on B200

@mfranzrebsal
Copy link
Copy Markdown
Author

IMO it would make sense to first focus on finalizing the implementation and addressing my questions in the description before adapting it to CUB, since a lot of code will probably change from that.

@bernhardmgruber
Copy link
Copy Markdown
Contributor

IMO it would make sense to first focus on finalizing the implementation and addressing my questions in the description before adapting it to CUB, since a lot of code will probably change from that.

Right.

I would like to understand how/why the pipelined version performs worse, since theoretically the algorithm seems superior, and the metrics mostly look better.

Maybe @ahendriksen can help you. If you have ncu profiles, you can ask him on Slack if we would be willing to review them and provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants