run_to_run deterministic device scan#9098
Conversation
bernhardmgruber
left a comment
There was a problem hiding this comment.
This is some impressive work! Because it touches a lot of the extremely performance sensitive pieces of warpspeed scan and lookback, we should split this work into small pieces and review them very carefully (including benchmarks and SASS diffs).
Here are some suggestions:
- Allowing warpspeed scan to run on Hopper should be it's own PR. That should just include the extension of the kernel by the atomic block index counter and the alternate path for the masked bulk copy store. That's a holistic changeset which can be merged easily with a benchmark on Hopper and a SASS diff for Blackwell.
- Then the determinism extensions for warpspeed scan. The kernel extensions are less critical since they only affect scan, so we can bear better with the host code changes.
- Then the determinism extensions to the lookback.
| warpspeed, | ||
| lookback_deterministic, | ||
| warpspeed_deterministic, | ||
| warpspeed_deterministic_atomic |
There was a problem hiding this comment.
Q: Do we have a reason to believe that the atomic version of the kernel could be faster than cluster launch control on Blackwell?
There was a problem hiding this comment.
Remark: Furthermore, I don't think determinism should be part of any tuning information. It's a functional requirement and should be passed through the kernel's template arguments.
| NV_IF_ELSE_TARGET( | ||
| NV_PROVIDES_SM_100, | ||
| ({ | ||
| if (::cuda::ptx::elect_sync(~0)) | ||
| { | ||
| ::cuda::ptx::cp_async_bulk_cp_mask( | ||
| ::cuda::ptx::space_global, | ||
| ::cuda::ptx::space_shared, | ||
| cpAsyncOobInfo.ptrGmemStartAlignDown, | ||
| srcSmem, | ||
| /*size*/ 16, | ||
| byteMaskStart); | ||
| } | ||
| }), | ||
| ({ | ||
| const int rank = squad.threadRank(); | ||
| if (rank < 16 && ((byteMaskStart >> rank) & 1u)) | ||
| { | ||
| reinterpret_cast<::cuda::std::byte*>(cpAsyncOobInfo.ptrGmemStartAlignDown)[rank] = srcSmem[rank]; | ||
| } | ||
| })); |
There was a problem hiding this comment.
Important: This seems to repeat a few times, please create a dedicated function for it (can also be a lambda).
| @@ -73,10 +72,10 @@ _CCCL_DEVICE_API void | |||
| storeTileAggregate(tile_state_t<AccumT>* ptrTileStates, scan_state scanState, AccumT sum, int index) | |||
| { | |||
| _CCCL_ASSERT(::cuda::is_aligned(ptrTileStates, alignof(tile_state_t<AccumT>)), ""); | |||
| _CCCL_ASSERT(index >= 0 && index < gridDim.x, "Reading out of bounds tile state"); | |||
| _CCCL_ASSERT(index >= 0, "Negative tile state index"); | |||
There was a problem hiding this comment.
Important: Why do we need to weaken the check for all architectures? Can't we keep it for the non deterministic version on Blackwell?
| template <bool RunToRunDeterministic, int numTileStatesPerThread, typename AccumT, typename ScanOpT> | ||
| [[nodiscard]] _CCCL_DEVICE_API _CCCL_FORCEINLINE AccumT warpIncrementalLookback( | ||
| SpecialRegisters specialRegisters, | ||
| tile_state_t<AccumT>* ptrTileStates, | ||
| const int idxTilePrev, | ||
| const AccumT sumExclusiveCtaPrev, | ||
| int& idxTilePrev, | ||
| AccumT& sumExclusiveCtaPrev, | ||
| const int idxTileNext, | ||
| ScanOpT& scan_op) | ||
| { |
There was a problem hiding this comment.
Suggestion: Given the large branch on RunToRunDeterministic below and the different mechanics, I think we should try to just separate the implementation into two warpIncrementalLookback functions. The old one, and the deterministic one. I think this will reduce confusion.
😬 CI Workflow Results🟥 Finished in 2h 00m: Pass: 24%/269 | Total: 3d 12h | Max: 1h 42m | Hits: 70%/59525See results here. |
Description
closes #7556
Implements run-to-run deterministic device scan by generating fixed-reduction tree by reduce tile-aggregates in chunks of 32.
Both warpspeed look-ahead algorithm and de-coupled lookback are run_to_run deterministic with this approach.
This PR also experiments warpspeed on Hopper (SM90 - H100) by using global atomic flag replacing workstealing (SM100), and it performs better than de-coupled for large problem sizes.
Some initial benchmarks

Checklist