CUB DeviceRotate#9061
Conversation
fd28cdb to
213a277
Compare
|
Thank you for the pull request! I looked briefly through the implementation and there is still a lot of work left to align it to how CUB works. I don't have time for this over the next two weeks, maybe somebody else can help. I think we should address:
|
|
IMO it would make sense to first focus on finalizing the implementation and addressing my questions in the description before adapting it to CUB, since a lot of code will probably change from that. |
Right.
Maybe @ahendriksen can help you. If you have ncu profiles, you can ask him on Slack if we would be willing to review them and provide feedback. |
Description
This MR adds a device-wide implementation of the in-place rotate algorithm for CUB. The changes contain three implementations (
short,long, andnaive), since the first two (which are truly in-place) can only be used when the rotate distance satisfies some specific criteria. For more information about the algorithms, please take a look at these (NVIDIA internal) slides.For the
shortalgorithm, I have added two implementations: one that launches as many CTAs per SM as possible, and another that instead uses pipelining combined with a different version of the algorithm. The motivation behind the pipelined version was to try to achieve SOL on H100, but when comparing profiles (link) on H100 PCIe (attached), the pipelined version is 15% slower, while the (to me) relevant metrics compare as follows:a. ⅓ long scoreboard.
b. 18x less barrier.
I would like to understand how/why the pipelined version performs worse, since theoretically the algorithm seems superior, and the metrics mostly look better. And what optimizations would be necessary to reach SOL on H100 and even B100? I'd be very interested to discuss this! Which version to use can be modulated by changing the variable
constexpr bool USE_SHORT_PIPELINE.The way the
longalgorithm works is also a bit unusual since, when querying the scratch space, there is aRotateState_tstruct that also gets filled and must be passed to the following function call unchanged, in order to avoid repeating the dependency graph creation. Would this be acceptable to have in CUB? Alternatively, if we allow copying of the first K (= rotate distance) elements to a scratch buffer we would know longer need that dependency graph. The downside of this, is that the memory movement is now2N + 2Kinstead of2N, which would substantially reduce the theoretically achievable performance whenK ~ N/2.Checklist
Benchmark Results
RotatePercentage = 0means a rotate distance of 1. Interesting here is how the performance does not really change between 1B and 8B data types, indicating that the extra work needed for the 1B type is "free".rotate_benchmark
[0] NVIDIA H100 80GB HBM3
[0] NVIDIA H100 80GB HBM3