cuda::simd Integer dot product#9064
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (20)
📝 Walkthroughsuggestion: WalkthroughThis PR integrates three new SIMD integer operations (idot, saturating_add, abs_diff) into libcudacxx with compile-time feature detection, low-level device intrinsics behind SM capability gates, high-level operation wrappers with fallback scalar paths, storage alignment updates, a fixed-size small-integral specialization, comprehensive functional tests, and expanded codegen validation. ChangesSIMD Intrinsics for Integer Operations
Suggested labels: Suggested reviewers:
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
libcudacxx/include/cuda/std/__simd/type_traits.h (1)
24-30:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd direct include for
<cuda/std/__algorithm/max.h>to this header. Line 44 uses::cuda::std::max, but this header is not included. This creates a transitive-include dependency, violating include self-sufficiency and risking breakage when include order changes.
🧹 Nitpick comments (3)
docs/libcudacxx/extended_api/simd/saturating_add.rst (1)
43-43: ⚡ Quick winsuggestion: Line 43 lists "VIADD.16x2" but the pattern suggests it should be "VIADD.U16x2" (unsigned 16-bit) to match the signed/unsigned pairs for 8-bit (VIADD.S8x4/VIADD.U8x4) and the signed 16-bit variant (VIADD.S16x2).
libcudacxx/test/simd_codegen/floating_point/plus_f32x2.cu (1)
25-26: ⚡ Quick winsuggestion: Add 1xx-family op checks in addition to SM100.
This currently validatesFADD2only forSM100; addingSM1XX(or explicitSM120f) keeps coverage aligned with the new 1xx arch support in the codegen infra.libcudacxx/test/simd_codegen/floating_point/unary_minus_f32x2.cu (1)
25-26: ⚡ Quick winsuggestion: Mirror 1xx-family checks here as well.
SM100-only op assertions leave newer 1xx targets under-checked; addingSM1XX(or explicitSM120f) patterns would keep this test robust across the enabled arch set.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1b5405d0-67de-4b60-9f88-207fad08e963
📒 Files selected for processing (51)
docs/libcudacxx/extended_api.rstdocs/libcudacxx/extended_api/simd.rstdocs/libcudacxx/extended_api/simd/abs_diff.rstdocs/libcudacxx/extended_api/simd/idot.rstdocs/libcudacxx/extended_api/simd/saturating_add.rstlibcudacxx/include/cuda/__simd/idot.hlibcudacxx/include/cuda/__simd/saturating_add.hlibcudacxx/include/cuda/__simd/simd_intrinsics.hlibcudacxx/include/cuda/__simd/simd_intrinsics_array.hlibcudacxx/include/cuda/__simd/vabsdiff.hlibcudacxx/include/cuda/simdlibcudacxx/include/cuda/std/__fwd/simd.hlibcudacxx/include/cuda/std/__internal/features.hlibcudacxx/include/cuda/std/__internal/namespaces.hlibcudacxx/include/cuda/std/__simd/basic_vec.hlibcudacxx/include/cuda/std/__simd/specializations/fixed_size_integral_vec.hlibcudacxx/include/cuda/std/__simd/specializations/fixed_size_storage.hlibcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics.hlibcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics_array.hlibcudacxx/include/cuda/std/__simd/type_traits.hlibcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpplibcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/saturation_add.pass.cpplibcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/vabsdiff.pass.cpplibcudacxx/test/libcudacxx/std/numerics/simd/simd.traits/alignment.pass.cpplibcudacxx/test/simd_codegen/CMakeLists.txtlibcudacxx/test/simd_codegen/floating_point/decrement_f32x2.culibcudacxx/test/simd_codegen/floating_point/fma_bf16.culibcudacxx/test/simd_codegen/floating_point/fma_f16.culibcudacxx/test/simd_codegen/floating_point/increment_f32x2.culibcudacxx/test/simd_codegen/floating_point/less_bf16.culibcudacxx/test/simd_codegen/floating_point/less_f16.culibcudacxx/test/simd_codegen/floating_point/minus_f32x2.culibcudacxx/test/simd_codegen/floating_point/multiplies_bf16.culibcudacxx/test/simd_codegen/floating_point/multiplies_f16.culibcudacxx/test/simd_codegen/floating_point/plus_bf16.culibcudacxx/test/simd_codegen/floating_point/plus_f16.culibcudacxx/test/simd_codegen/floating_point/plus_f32x2.culibcudacxx/test/simd_codegen/floating_point/unary_minus_f32x2.culibcudacxx/test/simd_codegen/fma_bf16.culibcudacxx/test/simd_codegen/fma_f16.culibcudacxx/test/simd_codegen/idot/idp2.culibcudacxx/test/simd_codegen/idot/idp4.culibcudacxx/test/simd_codegen/integer/arithmetic_u16x2.culibcudacxx/test/simd_codegen/integer/arithmetic_u8x4.culibcudacxx/test/simd_codegen/integer/bitwise_u16x2_u8x4.culibcudacxx/test/simd_codegen/minus_f32x2.culibcudacxx/test/simd_codegen/multiplies_bf16.culibcudacxx/test/simd_codegen/plus_bf16.culibcudacxx/test/simd_codegen/plus_f32x2.culibcudacxx/test/simd_codegen/saturation_add/saturating_add.culibcudacxx/test/simd_codegen/vabsdiff/vabsdiff.cu
💤 Files with no reviewable changes (6)
- libcudacxx/test/simd_codegen/multiplies_bf16.cu
- libcudacxx/test/simd_codegen/plus_bf16.cu
- libcudacxx/test/simd_codegen/minus_f32x2.cu
- libcudacxx/test/simd_codegen/fma_bf16.cu
- libcudacxx/test/simd_codegen/fma_f16.cu
- libcudacxx/test/simd_codegen/plus_f32x2.cu
There was a problem hiding this comment.
Actionable comments posted: 2
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e1bf2e48-37e6-4ae3-9c10-bf0d22c8b76e
📒 Files selected for processing (4)
libcudacxx/include/cuda/__simd/simd_intrinsics.hlibcudacxx/include/cuda/__simd/simd_intrinsics_array.hlibcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpplibcudacxx/test/simd_codegen/idot/idp2.cu
| template <typename _Tp, typename _Up, typename _AccumT, ::cuda::std::size_t _Np> | ||
| [[nodiscard]] _CCCL_DEVICE_API _AccumT __dp4a_8bit_x4( | ||
| const ::cuda::std::simd::__array_u32_t<_Np>& __lhs_u, | ||
| const ::cuda::std::simd::__array_u32_t<_Np>& __rhs_u, | ||
| const _AccumT __acc) noexcept | ||
| { | ||
| _AccumT __result = __acc; | ||
| _CCCL_PRAGMA_UNROLL_FULL() | ||
| for (::cuda::std::size_t __i = 0; __i < _Np; ++__i) | ||
| { | ||
| if constexpr (::cuda::std::is_unsigned_v<_Tp> && ::cuda::std::is_unsigned_v<_Up>) | ||
| { | ||
| __result = ::cuda::simd::__dp4a_u8x4_u8x4(__lhs_u[__i], __rhs_u[__i], __result); | ||
| } | ||
| else if constexpr (::cuda::std::is_unsigned_v<_Tp>) | ||
| { | ||
| __result = ::cuda::simd::__dp4a_u8x4_s8x4(__lhs_u[__i], __rhs_u[__i], __result); | ||
| } | ||
| else if constexpr (::cuda::std::is_unsigned_v<_Up>) | ||
| { | ||
| __result = ::cuda::simd::__dp4a_s8x4_u8x4(__lhs_u[__i], __rhs_u[__i], __result); | ||
| } | ||
| else | ||
| { | ||
| __result = ::cuda::simd::__dp4a_s8x4_s8x4(__lhs_u[__i], __rhs_u[__i], __result); | ||
| } | ||
| } | ||
| return __result; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n 'long long|unsigned long long|__int128_t' libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpp
python - <<'PY'
MASK = (1 << 32) - 1
acc = 1 << 40
chunk = 12345
current = ((acc & MASK) + chunk) & MASK
exact = acc + chunk
print(f"truncated helper pattern: {current}")
print(f"exact AccT accumulation: {exact}")
PYRepository: NVIDIA/cccl
Length of output: 474
🏁 Script executed:
# Find the simd_intrinsics.h file to check intrinsic signatures
find . -name "simd_intrinsics.h" -type fRepository: NVIDIA/cccl
Length of output: 177
🏁 Script executed:
# Check the intrinsic function signatures in simd_intrinsics.h
fd simd_intrinsics.h | head -1 | xargs cat | grep -A 10 "__dp4a_"Repository: NVIDIA/cccl
Length of output: 3096
🏁 Script executed:
# Also check the dp2a intrinsics
fd simd_intrinsics.h | head -1 | xargs cat | grep -A 10 "__dp2a_"Repository: NVIDIA/cccl
Length of output: 6501
🏁 Script executed:
cat -n libcudacxx/include/cuda/__simd/simd_intrinsics_array.h | sed -n '111,140p'Repository: NVIDIA/cccl
Length of output: 1291
🏁 Script executed:
cat -n libcudacxx/include/cuda/__simd/simd_intrinsics_array.h | sed -n '141,200p'Repository: NVIDIA/cccl
Length of output: 2279
important: wide accumulators are truncated to 32 bits every iteration.
At lines 123, 127, 131, 135 (and similarly at lines 156, 160, 167, 171, 178, 182, 189, 193), __dp4a_* and __dp2a_* intrinsics accept and return only 32-bit types, but _AccumT __result is passed directly without intermediate conversion. When _AccumT is wider than 32 bits (long long, unsigned long long, __int128_t), the implicit conversion to 32-bit truncates the upper bits before each intrinsic call, then the 32-bit result expands back with zero-filled upper bits. This breaks the idot contract: the running sum loses precision on every iteration once it exceeds 32 bits.
Use a 32-bit temporary for each intrinsic call, then add the widened result back into _AccumT __result instead of threading _AccumT through the instruction. Current test cases instantiate wide accumulators but use small values that fit within 32 bits, so the bug is not caught.
There was a problem hiding this comment.
That is actually a goood comment, I believe we should static assert that _AccumT is integral and has size 4
There was a problem hiding this comment.
suggestion: agreed. The static_assert belongs inside __dp4a_8bit_x4 and __dp2a_16bit_x2_8bit_x4 — not on the public idot signature — so the scalar fallback in idot.h can still legally accept wider accumulators (e.g., __int128). Only the intrinsic helpers need the constraint since the hardware only speaks 32-bit.
Concrete addition at the top of each helper:
static_assert(::cuda::std::is_integral_v<_AccumT> && sizeof(_AccumT) == 4,
"idot intrinsic path requires a 32-bit integral accumulator");Same assert in __dp2a_16bit_x2_8bit_x4. The public idot.h dispatch already checks sizes/signs before routing into these helpers, so in practice the assert should never fire — but it makes the contract explicit and catches future callers that bypass the routing logic.
🧠 Learnings used
Learnt from: sleeepyjack
Repo: NVIDIA/cccl PR: 8857
File: cudax/include/cuda/experimental/__cuco/__hyperloglog/default_policy.cuh:92-95
Timestamp: 2026-05-06T23:47:10.296Z
Learning: In this CCCL (CUDA C++ Core Libraries) codebase, do NOT treat `_CCCL_API` as host-only. `_CCCL_API` is defined in `libcudacxx/include/cuda/std/__cccl/visibility.h` and always expands to include `_CCCL_HOST_DEVICE` (possibly alongside `_CCCL_TILE`, `_CCCL_VISIBILITY_HIDDEN`, and `_CCCL_EXCLUDE_FROM_EXPLICIT_INSTANTIATION`). So any function annotated with `_CCCL_API` is already valid for both host and device code; reviewers should not flag it as host-only.
Learnt from: sleeepyjack
Repo: NVIDIA/cccl PR: 8857
File: cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh:426-428
Timestamp: 2026-05-06T23:47:17.597Z
Learning: In the CCCL/CUDA C++ Core Libraries codebase (NVIDIA/cccl), treat `_CCCL_API` as host-and-device callable. `_CCCL_API` expands to `_CCCL_HOST_DEVICE` via `libcudacxx/include/cuda/std/__cccl/visibility.h`, so functions marked with `_CCCL_API` may be invoked from both host and device code. When reviewing device-code contexts, do not flag `_CCCL_API`-annotated functions as if they were host-only.
| /* | ||
|
|
||
| ; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s16_s8.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*LO.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*HI.*}} | ||
|
|
||
| ; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s8_s16.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*LO.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*HI.*}} | ||
|
|
||
| ; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s16_u8.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*LO.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*HI.*}} | ||
|
|
||
| ; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u8_s16.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*LO.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*HI.*}} | ||
|
|
||
| ; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u16_s8.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*LO.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*HI.*}} | ||
|
|
||
| ; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s8_u16.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*LO.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*HI.*}} | ||
|
|
||
| ; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u16_u8.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*LO.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*HI.*}} | ||
|
|
||
| ; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u8_u16.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*LO.*}} | ||
| ; SMXX-DAG: {{.*IDP\.2A.*HI.*}} | ||
|
|
||
| */ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n 'SMXX-(LABEL|DAG)' libcudacxx/test/simd_codegen/idot/idp2.cuRepository: NVIDIA/cccl
Length of output: 1151
important: Anchor codegen assertions to specific functions. Lines 63-93 use only SMXX-DAG matches, allowing function names and instruction patterns to float globally across the file. A broken mixed-type wrapper can pass if any other wrapper emits the required instruction. Use SMXX-LABEL to anchor each function name, with the instruction checks (IDP.2A.*LO and IDP.2A.*HI) immediately following under that anchor.
This comment has been minimized.
This comment has been minimized.
😬 CI Workflow Results🟥 Finished in 6h 56m: Pass: 92%/116 | Total: 5d 13h | Max: 5h 34m | Hits: 56%/2884979See results here. |
Description
Description
Introduce SIMD
idotto compute the dot product of two integer vectors.The operation maps to
IDP4/IDP2HW instructions.The PR includes the implementation, unit test, documentation, and codegen checks.
Use cases
Requires:
cuda::simdAddabs_diff#8994