Implement block TopK sieve and ranking API for handling multi-key workloads by pauleonix · Pull Request #9066 · NVIDIA/cccl

pauleonix · 2026-05-19T04:25:59Z

Description

During testing I found and fixed a bug in the previous block TopK AIR implementation regarding bit-twiddling-inversion and 0.0/-0.0 handling.

TODO:

Compare performance to old implementation (no pessimization of segmented device TopK)
~~Add simple optimization of integer scan+adjacent difference~~ (No perf improvement with current segmented topk tuning)

closes #8368

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes. (non-public API)

Includes testing

copy-pr-bot · 2026-05-19T04:26:03Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-19T04:34:59Z

📝 Walkthrough

Summary by CodeRabbit

Release Notes

Refactor
- Refactored block-level top-k selection algorithm into modular, reusable components for improved code maintainability and extensibility
Tests
- Added comprehensive test coverage for block-level top-k operations, including validation of edge cases, floating-point key handling, tie-breaking correctness across varied input distributions and memory layouts, and multi-key selection scenarios

suggestion:

Walkthrough

Adds per-item top-k state, a multi-pass radix sieve for iterative refinement, and an atomic rank finalizer; refactors block_topk_air to compose these components and updates TempStorage binding. New host/test utilities and parameterized Catch2 device tests validate correctness across modes and FP edge cases.

Changes

Block-level multi-key top-k via iterative refinement

Layer / File(s)	Summary
Radix sieve implementation `cub/cub/block/specializations/block_topk_sieve_air.cuh`	Multi-pass radix histogram sieve with prefix widening, bucket selection, and FP -0.0 handling; exposes TempStorage and `refine_keys` API.
Per-item state and wrappers `cub/cub/block/block_topk_rank.cuh`	Adds `block_topk_key_states<ItemsPerThread>` and high-level wrappers: `block_topk_sieve` (select/refine) and `block_topk_rank` (rank wrapper).
Atomic rank finalization `cub/cub/block/specializations/block_topk_rank_atomic.cuh`	`block_topk_rank_atomic<BlockDimX>` assigns scatter ranks using shared atomic counters and updates per-item states.
block_topk_air integration `cub/cub/block/specializations/block_topk_air.cuh`	Refactors `select_topk` to use sieve/rank components, removes inline radix machinery, and updates TempStorage/template parameters.
block_topk TempStorage binding `cub/cub/block/block_topk.cuh`	Refactors `TempStorage` to `Uninitialized` wrapper and binds internal storage via `Alias()`; rewires max/min APIs to use the bound storage.
Test utilities and key generation `cub/test/catch2_test_block_topk_common.cuh`	Deterministic RNG, boundary-key multiset generator with controlled overhang and FP edge cases, Catch2 generators, `sorted_top_k`, `to_span`, and `bit_repr`.
Comprehensive device kernel tests `cub/test/catch2_test_block_topk_rank.cu`	Parameterized Catch2 tests covering randomized smoke, FP edge cases, full/partial tiles, multi-key tie-breaking, and bit-window split equivalence.

Assessment against linked issues

Objective	Addressed	Explanation
Support multi-key top-k through consecutive refinement passes without keeping all key fragments in on-chip memory [`#8368`]	✅
Enable interface accepting and returning per-item status object (selected vs candidate/ties) for flexible user-controlled refinement [`#8368`]	✅

Suggested labels

libcu++

Suggested reviewers

elstehle

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

- Implement CodeRabbit Feedback - Make sure tests do not trigger assertions - Use non-specialized sieve/rank API in block_topk_air but check chosen specialization via TempStorage.

As suggested by CodeRabbit nitpick.

github-actions · 2026-05-21T02:52:34Z

🥳 CI Workflow Results

🟩 Finished in 1h 11m: Pass: 100%/283 | Total: 2d 20h | Max: 45m 41s | Hits: 88%/219289

See results here.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cub/test/catch2_test_block_topk_rank.cu (2)

75-84: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

critical: device kernel contains lambda expression

Guideline explicitly forbids lambda expressions in device-only or host-device code. Refactor using if-constexpr directly:

- auto states = [&] {
-   if constexpr (SelectMax)
-   {
-     return sieve.template select_max<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()));
-   }
-   else
-   {
-     return sieve.template select_min<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()));
-   }
- }();
+ decltype(sieve.template select_max<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()))) states;
+ if constexpr (SelectMax)
+ {
+   states = sieve.template select_max<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()));
+ }
+ else
+ {
+   states = sieve.template select_min<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()));
+ }

As per coding guidelines: "Never allow lambda expressions in device-only or host-device code".

162-173: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

critical: device kernel contains lambda expression

Same violation as single_key_kernel. Apply identical refactoring:

- auto states = [&] {
-   if constexpr (SelectMax)
-   {
-     return primary_sieve.template select_max<IsFullTile, blocked_input>(
-       primary_keys, k, static_cast<int>(g_primary.size()));
-   }
-   else
-   {
-     return primary_sieve.template select_min<IsFullTile, blocked_input>(
-       primary_keys, k, static_cast<int>(g_primary.size()));
-   }
- }();
+ decltype(primary_sieve.template select_max<IsFullTile, blocked_input>(primary_keys, k, static_cast<int>(g_primary.size()))) states;
+ if constexpr (SelectMax)
+ {
+   states = primary_sieve.template select_max<IsFullTile, blocked_input>(primary_keys, k, static_cast<int>(g_primary.size()));
+ }
+ else
+ {
+   states = primary_sieve.template select_min<IsFullTile, blocked_input>(primary_keys, k, static_cast<int>(g_primary.size()));
+ }

As per coding guidelines: "Never allow lambda expressions in device-only or host-device code".

🧹 Nitpick comments (3)

cub/test/catch2_test_block_topk_rank.cu (3)
224-224: 💤 Low value

suggestion: use static_cast for consistency

The int{sizeof(KeyT) * 8} cast is valid but static_cast<int>(sizeof(KeyT) * 8) is more common in the codebase and equally clear:
- static_assert(int{sizeof(KeyT) * 8} % WindowBits == 0, "test currently requires window-aligned key width");
+ static_assert(static_cast<int>(sizeof(KeyT) * 8) % WindowBits == 0, "test currently requires window-aligned key width");
241-241: 💤 Low value

suggestion: use static_cast for consistency

Same as line 224, prefer static_cast<int>(sizeof(KeyT) * 8) for consistency:
- int hi = int{sizeof(KeyT) * 8};
+ int hi = static_cast<int>(sizeof(KeyT) * 8);
309-309: ⚖️ Poor tradeoff

suggestion: consider testing k=0 edge case

The REQUIRE explicitly excludes k=0. If the API should support k=0 (returning empty output), add a test case verifying no writes occur. If k=0 is invalid input, document or assert that constraint in the implementation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c086ff4-c3df-4a37-8169-2645a8f0c345

📥 Commits

Reviewing files that changed from the base of the PR and between de41417 and 09bd804.

📒 Files selected for processing (2)

cub/cub/block/specializations/block_topk_air.cuh
cub/test/catch2_test_block_topk_rank.cu

pauleonix added 2 commits May 19, 2026 06:20

Implement block topk sieve and ranking

171b1cd

Includes testing

Use sieve/ranking in high level block topk air

f2f2b6d

pauleonix self-assigned this May 19, 2026

github-project-automation Bot added this to CCCL May 19, 2026

github-project-automation Bot moved this to Todo in CCCL May 19, 2026

cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 19, 2026

This comment was marked as resolved.

Sign in to view

pauleonix requested a review from elstehle May 19, 2026 04:26

This comment was marked as resolved.

Sign in to view

Fix GCC7

f590eed

This comment has been minimized.

Sign in to view

Clean up

cfe022a

- Implement CodeRabbit Feedback - Make sure tests do not trigger assertions - Use non-specialized sieve/rank API in block_topk_air but check chosen specialization via TempStorage.

pauleonix marked this pull request as ready for review May 20, 2026 02:50

pauleonix requested a review from a team as a code owner May 20, 2026 02:50

cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 20, 2026

This comment was marked as resolved.

Sign in to view

Extend testing

fd60471

As suggested by CodeRabbit nitpick.

This comment was marked as resolved.

Sign in to view

Implement feedback

de41417

This comment was marked as resolved.

Sign in to view

This comment has been minimized.

Sign in to view

Fixing more nits

e3165a1

This comment was marked as resolved.

Sign in to view

This comment has been minimized.

Sign in to view

Add back early exit for k > valid_items

09bd804

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Conversation

pauleonix commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Assessment against linked issues

Suggested labels

Suggested reviewers

Review ran into problems

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

Uh oh!

This comment has been minimized.

github-actions Bot commented May 21, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 11m: Pass: 100%/283 | Total: 2d 20h | Max: 45m 41s | Hits: 88%/219289

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pauleonix commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading