Skip to content

run_to_run deterministic device scan#9098

Open
srinivasyadav18 wants to merge 1 commit into
NVIDIA:mainfrom
srinivasyadav18:run_to_run_scan_pr
Open

run_to_run deterministic device scan#9098
srinivasyadav18 wants to merge 1 commit into
NVIDIA:mainfrom
srinivasyadav18:run_to_run_scan_pr

Conversation

@srinivasyadav18
Copy link
Copy Markdown
Contributor

@srinivasyadav18 srinivasyadav18 commented May 21, 2026

Description

closes #7556

Implements run-to-run deterministic device scan by generating fixed-reduction tree by reduce tile-aggregates in chunks of 32.
Both warpspeed look-ahead algorithm and de-coupled lookback are run_to_run deterministic with this approach.

This PR also experiments warpspeed on Hopper (SM90 - H100) by using global atomic flag replacing workstealing (SM100), and it performs better than de-coupled for large problem sizes.

Some initial benchmarks
scan_f32_228

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@srinivasyadav18 srinivasyadav18 requested review from a team as code owners May 21, 2026 17:40
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 21, 2026
@srinivasyadav18 srinivasyadav18 requested a review from davebayer May 21, 2026 17:40
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 21, 2026
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some impressive work! Because it touches a lot of the extremely performance sensitive pieces of warpspeed scan and lookback, we should split this work into small pieces and review them very carefully (including benchmarks and SASS diffs).

Here are some suggestions:

  • Allowing warpspeed scan to run on Hopper should be it's own PR. That should just include the extension of the kernel by the atomic block index counter and the alternate path for the masked bulk copy store. That's a holistic changeset which can be merged easily with a benchmark on Hopper and a SASS diff for Blackwell.
  • Then the determinism extensions for warpspeed scan. The kernel extensions are less critical since they only affect scan, so we can bear better with the host code changes.
  • Then the determinism extensions to the lookback.

Comment on lines +623 to +626
warpspeed,
lookback_deterministic,
warpspeed_deterministic,
warpspeed_deterministic_atomic
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Do we have a reason to believe that the atomic version of the kernel could be faster than cluster launch control on Blackwell?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remark: Furthermore, I don't think determinism should be part of any tuning information. It's a functional requirement and should be passed through the kernel's template arguments.

Comment on lines +274 to +294
NV_IF_ELSE_TARGET(
NV_PROVIDES_SM_100,
({
if (::cuda::ptx::elect_sync(~0))
{
::cuda::ptx::cp_async_bulk_cp_mask(
::cuda::ptx::space_global,
::cuda::ptx::space_shared,
cpAsyncOobInfo.ptrGmemStartAlignDown,
srcSmem,
/*size*/ 16,
byteMaskStart);
}
}),
({
const int rank = squad.threadRank();
if (rank < 16 && ((byteMaskStart >> rank) & 1u))
{
reinterpret_cast<::cuda::std::byte*>(cpAsyncOobInfo.ptrGmemStartAlignDown)[rank] = srcSmem[rank];
}
}));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: This seems to repeat a few times, please create a dedicated function for it (can also be a lambda).

@@ -73,10 +72,10 @@ _CCCL_DEVICE_API void
storeTileAggregate(tile_state_t<AccumT>* ptrTileStates, scan_state scanState, AccumT sum, int index)
{
_CCCL_ASSERT(::cuda::is_aligned(ptrTileStates, alignof(tile_state_t<AccumT>)), "");
_CCCL_ASSERT(index >= 0 && index < gridDim.x, "Reading out of bounds tile state");
_CCCL_ASSERT(index >= 0, "Negative tile state index");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Why do we need to weaken the check for all architectures? Can't we keep it for the non deterministic version on Blackwell?

Comment on lines +179 to 187
template <bool RunToRunDeterministic, int numTileStatesPerThread, typename AccumT, typename ScanOpT>
[[nodiscard]] _CCCL_DEVICE_API _CCCL_FORCEINLINE AccumT warpIncrementalLookback(
SpecialRegisters specialRegisters,
tile_state_t<AccumT>* ptrTileStates,
const int idxTilePrev,
const AccumT sumExclusiveCtaPrev,
int& idxTilePrev,
AccumT& sumExclusiveCtaPrev,
const int idxTileNext,
ScanOpT& scan_op)
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Given the large branch on RunToRunDeterministic below and the different mechanics, I think we should try to just separate the implementation into two warpIncrementalLookback functions. The old one, and the deterministic one. I think this will reduce confusion.

@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 2h 00m: Pass: 24%/269 | Total: 3d 12h | Max: 1h 42m | Hits: 70%/59525

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Productize run-to-run DeviceScan

2 participants