Skip to content

Implement block TopK sieve and ranking API for handling multi-key workloads #9066

Open
pauleonix wants to merge 8 commits into
NVIDIA:mainfrom
pauleonix:blockTopKMultiKey
Open

Implement block TopK sieve and ranking API for handling multi-key workloads #9066
pauleonix wants to merge 8 commits into
NVIDIA:mainfrom
pauleonix:blockTopKMultiKey

Conversation

@pauleonix
Copy link
Copy Markdown
Contributor

@pauleonix pauleonix commented May 19, 2026

Description

During testing I found and fixed a bug in the previous block TopK AIR implementation regarding bit-twiddling-inversion and 0.0/-0.0 handling.

TODO:

  • Compare performance to old implementation (no pessimization of segmented device TopK)
  • Add simple optimization of integer scan+adjacent difference (No perf improvement with current segmented topk tuning)

closes #8368

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes. (non-public API)

@pauleonix pauleonix self-assigned this May 19, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 19, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 19, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 19, 2026
@pauleonix

This comment was marked as resolved.

@pauleonix pauleonix requested a review from elstehle May 19, 2026 04:26
@pauleonix

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Refactor

    • Refactored block-level top-k selection algorithm into modular, reusable components for improved code maintainability and extensibility
  • Tests

    • Added comprehensive test coverage for block-level top-k operations, including validation of edge cases, floating-point key handling, tie-breaking correctness across varied input distributions and memory layouts, and multi-key selection scenarios

suggestion:

Walkthrough

Adds per-item top-k state, a multi-pass radix sieve for iterative refinement, and an atomic rank finalizer; refactors block_topk_air to compose these components and updates TempStorage binding. New host/test utilities and parameterized Catch2 device tests validate correctness across modes and FP edge cases.

Changes

Block-level multi-key top-k via iterative refinement

Layer / File(s) Summary
Radix sieve implementation
cub/cub/block/specializations/block_topk_sieve_air.cuh
Multi-pass radix histogram sieve with prefix widening, bucket selection, and FP -0.0 handling; exposes TempStorage and refine_keys API.
Per-item state and wrappers
cub/cub/block/block_topk_rank.cuh
Adds block_topk_key_states<ItemsPerThread> and high-level wrappers: block_topk_sieve (select/refine) and block_topk_rank (rank wrapper).
Atomic rank finalization
cub/cub/block/specializations/block_topk_rank_atomic.cuh
block_topk_rank_atomic<BlockDimX> assigns scatter ranks using shared atomic counters and updates per-item states.
block_topk_air integration
cub/cub/block/specializations/block_topk_air.cuh
Refactors select_topk to use sieve/rank components, removes inline radix machinery, and updates TempStorage/template parameters.
block_topk TempStorage binding
cub/cub/block/block_topk.cuh
Refactors TempStorage to Uninitialized wrapper and binds internal storage via Alias(); rewires max/min APIs to use the bound storage.
Test utilities and key generation
cub/test/catch2_test_block_topk_common.cuh
Deterministic RNG, boundary-key multiset generator with controlled overhang and FP edge cases, Catch2 generators, sorted_top_k, to_span, and bit_repr.
Comprehensive device kernel tests
cub/test/catch2_test_block_topk_rank.cu
Parameterized Catch2 tests covering randomized smoke, FP edge cases, full/partial tiles, multi-key tie-breaking, and bit-window split equivalence.

Assessment against linked issues

Objective Addressed Explanation
Support multi-key top-k through consecutive refinement passes without keeping all key fragments in on-chip memory [#8368]
Enable interface accepting and returning per-item status object (selected vs candidate/ties) for flexible user-controlled refinement [#8368]

Suggested labels

libcu++

Suggested reviewers

  • elstehle

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

This comment has been minimized.

- Implement CodeRabbit Feedback
- Make sure tests do not trigger assertions
- Use non-specialized sieve/rank API in block_topk_air but check chosen
  specialization via TempStorage.
@pauleonix pauleonix marked this pull request as ready for review May 20, 2026 02:50
@pauleonix pauleonix requested a review from a team as a code owner May 20, 2026 02:50
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 20, 2026
coderabbitai[bot]

This comment was marked as resolved.

As suggested by CodeRabbit nitpick.
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

This comment has been minimized.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 11m: Pass: 100%/283 | Total: 2d 20h | Max: 45m 41s | Hits: 88%/219289

See results here.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cub/test/catch2_test_block_topk_rank.cu (2)

75-84: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

critical: device kernel contains lambda expression

Guideline explicitly forbids lambda expressions in device-only or host-device code. Refactor using if-constexpr directly:

- auto states = [&] {
-   if constexpr (SelectMax)
-   {
-     return sieve.template select_max<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()));
-   }
-   else
-   {
-     return sieve.template select_min<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()));
-   }
- }();
+ decltype(sieve.template select_max<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()))) states;
+ if constexpr (SelectMax)
+ {
+   states = sieve.template select_max<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()));
+ }
+ else
+ {
+   states = sieve.template select_min<IsFullTile, BlockedInput>(keys, k, static_cast<int>(g_in.size()));
+ }

As per coding guidelines: "Never allow lambda expressions in device-only or host-device code".


162-173: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

critical: device kernel contains lambda expression

Same violation as single_key_kernel. Apply identical refactoring:

- auto states = [&] {
-   if constexpr (SelectMax)
-   {
-     return primary_sieve.template select_max<IsFullTile, blocked_input>(
-       primary_keys, k, static_cast<int>(g_primary.size()));
-   }
-   else
-   {
-     return primary_sieve.template select_min<IsFullTile, blocked_input>(
-       primary_keys, k, static_cast<int>(g_primary.size()));
-   }
- }();
+ decltype(primary_sieve.template select_max<IsFullTile, blocked_input>(primary_keys, k, static_cast<int>(g_primary.size()))) states;
+ if constexpr (SelectMax)
+ {
+   states = primary_sieve.template select_max<IsFullTile, blocked_input>(primary_keys, k, static_cast<int>(g_primary.size()));
+ }
+ else
+ {
+   states = primary_sieve.template select_min<IsFullTile, blocked_input>(primary_keys, k, static_cast<int>(g_primary.size()));
+ }

As per coding guidelines: "Never allow lambda expressions in device-only or host-device code".

🧹 Nitpick comments (3)
cub/test/catch2_test_block_topk_rank.cu (3)

224-224: 💤 Low value

suggestion: use static_cast for consistency

The int{sizeof(KeyT) * 8} cast is valid but static_cast<int>(sizeof(KeyT) * 8) is more common in the codebase and equally clear:

- static_assert(int{sizeof(KeyT) * 8} % WindowBits == 0, "test currently requires window-aligned key width");
+ static_assert(static_cast<int>(sizeof(KeyT) * 8) % WindowBits == 0, "test currently requires window-aligned key width");

241-241: 💤 Low value

suggestion: use static_cast for consistency

Same as line 224, prefer static_cast<int>(sizeof(KeyT) * 8) for consistency:

- int hi = int{sizeof(KeyT) * 8};
+ int hi = static_cast<int>(sizeof(KeyT) * 8);

309-309: ⚖️ Poor tradeoff

suggestion: consider testing k=0 edge case

The REQUIRE explicitly excludes k=0. If the API should support k=0 (returning empty output), add a test case verifying no writes occur. If k=0 is invalid input, document or assert that constraint in the implementation.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c086ff4-c3df-4a37-8169-2645a8f0c345

📥 Commits

Reviewing files that changed from the base of the PR and between de41417 and 09bd804.

📒 Files selected for processing (2)
  • cub/cub/block/specializations/block_topk_air.cuh
  • cub/test/catch2_test_block_topk_rank.cu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Add support for multi-key Top-K in BlockTopK

1 participant