run_to_run deterministic device scan by srinivasyadav18 · Pull Request #9098 · NVIDIA/cccl

srinivasyadav18 · 2026-05-21T17:40:25Z

Description

Implements run-to-run deterministic device scan by generating fixed-reduction tree by reduce tile-aggregates in chunks of 32.
Both warpspeed look-ahead algorithm and de-coupled lookback are run_to_run deterministic with this approach.

This PR also experiments warpspeed on Hopper (SM90 - H100) by using global atomic flag replacing workstealing (SM100), and it performs better than de-coupled for large problem sizes.

Some initial benchmarks

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

bernhardmgruber

This is some impressive work! Because it touches a lot of the extremely performance sensitive pieces of warpspeed scan and lookback, we should split this work into small pieces and review them very carefully (including benchmarks and SASS diffs).

Here are some suggestions:

Allowing warpspeed scan to run on Hopper should be it's own PR. That should just include the extension of the kernel by the atomic block index counter and the alternate path for the masked bulk copy store. That's a holistic changeset which can be merged easily with a benchmark on Hopper and a SASS diff for Blackwell.
Then the determinism extensions for warpspeed scan. The kernel extensions are less critical since they only affect scan, so we can bear better with the host code changes.
Then the determinism extensions to the lookback.

bernhardmgruber · 2026-05-21T18:44:40Z

+  warpspeed,
+  lookback_deterministic,
+  warpspeed_deterministic,
+  warpspeed_deterministic_atomic


Q: Do we have a reason to believe that the atomic version of the kernel could be faster than cluster launch control on Blackwell?

Remark: Furthermore, I don't think determinism should be part of any tuning information. It's a functional requirement and should be passed through the kernel's template arguments.

bernhardmgruber · 2026-05-21T18:51:20Z

+        NV_IF_ELSE_TARGET(
+          NV_PROVIDES_SM_100,
+          ({
+            if (::cuda::ptx::elect_sync(~0))
+            {
+              ::cuda::ptx::cp_async_bulk_cp_mask(
+                ::cuda::ptx::space_global,
+                ::cuda::ptx::space_shared,
+                cpAsyncOobInfo.ptrGmemStartAlignDown,
+                srcSmem,
+                /*size*/ 16,
+                byteMaskStart);
+            }
+          }),
+          ({
+            const int rank = squad.threadRank();
+            if (rank < 16 && ((byteMaskStart >> rank) & 1u))
+            {
+              reinterpret_cast<::cuda::std::byte*>(cpAsyncOobInfo.ptrGmemStartAlignDown)[rank] = srcSmem[rank];
+            }
+          }));


Important: This seems to repeat a few times, please create a dedicated function for it (can also be a lambda).

bernhardmgruber · 2026-05-21T18:52:18Z

@@ -73,10 +72,10 @@ _CCCL_DEVICE_API void
 storeTileAggregate(tile_state_t<AccumT>* ptrTileStates, scan_state scanState, AccumT sum, int index)
 {
  _CCCL_ASSERT(::cuda::is_aligned(ptrTileStates, alignof(tile_state_t<AccumT>)), "");
-  _CCCL_ASSERT(index >= 0 && index < gridDim.x, "Reading out of bounds tile state");
+  _CCCL_ASSERT(index >= 0, "Negative tile state index");


Important: Why do we need to weaken the check for all architectures? Can't we keep it for the non deterministic version on Blackwell?

bernhardmgruber · 2026-05-21T18:53:37Z

+template <bool RunToRunDeterministic, int numTileStatesPerThread, typename AccumT, typename ScanOpT>
 [[nodiscard]] _CCCL_DEVICE_API _CCCL_FORCEINLINE AccumT warpIncrementalLookback(
  SpecialRegisters specialRegisters,
  tile_state_t<AccumT>* ptrTileStates,
-  const int idxTilePrev,
-  const AccumT sumExclusiveCtaPrev,
+  int& idxTilePrev,
+  AccumT& sumExclusiveCtaPrev,
  const int idxTileNext,
  ScanOpT& scan_op)
 {


Suggestion: Given the large branch on RunToRunDeterministic below and the different mechanics, I think we should try to just separate the implementation into two warpIncrementalLookback functions. The old one, and the deterministic one. I think this will reduce confusion.

github-actions · 2026-05-21T19:42:42Z

😬 CI Workflow Results

🟥 Finished in 2h 00m: Pass: 24%/269 | Total: 3d 12h | Max: 1h 42m | Hits: 70%/59525

See results here.

run_to_run deterministic device scan

3c33cdf

srinivasyadav18 requested review from a team as code owners May 21, 2026 17:40

github-project-automation Bot added this to CCCL May 21, 2026

srinivasyadav18 requested a review from oleksandr-pavlyk May 21, 2026 17:40

github-project-automation Bot moved this to Todo in CCCL May 21, 2026

srinivasyadav18 requested a review from davebayer May 21, 2026 17:40

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 21, 2026

bernhardmgruber reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_to_run deterministic device scan#9098

run_to_run deterministic device scan#9098
srinivasyadav18 wants to merge 1 commit into
NVIDIA:mainfrom
srinivasyadav18:run_to_run_scan_pr

srinivasyadav18 commented May 21, 2026 •

edited by bernhardmgruber

Loading

Uh oh!

bernhardmgruber left a comment

Uh oh!

bernhardmgruber May 21, 2026

Uh oh!

bernhardmgruber May 21, 2026

Uh oh!

bernhardmgruber May 21, 2026

Uh oh!

bernhardmgruber May 21, 2026

Uh oh!

bernhardmgruber May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

srinivasyadav18 commented May 21, 2026 • edited by bernhardmgruber Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

bernhardmgruber left a comment

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber May 21, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber May 21, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber May 21, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber May 21, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber May 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026

😬 CI Workflow Results

🟥 Finished in 2h 00m: Pass: 24%/269 | Total: 3d 12h | Max: 1h 42m | Hits: 70%/59525

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

srinivasyadav18 commented May 21, 2026 •

edited by bernhardmgruber

Loading