Skip to content

cuda::simd Integer dot product#9064

Open
fbusato wants to merge 4 commits into
NVIDIA:mainfrom
fbusato:simd-cuda-idot
Open

cuda::simd Integer dot product#9064
fbusato wants to merge 4 commits into
NVIDIA:mainfrom
fbusato:simd-cuda-idot

Conversation

@fbusato
Copy link
Copy Markdown
Contributor

@fbusato fbusato commented May 18, 2026

Description

Description

Introduce SIMD idot to compute the dot product of two integer vectors.
The operation maps to IDP4/IDP2 HW instructions.
The PR includes the implementation, unit test, documentation, and codegen checks.

Use cases

  • INT8 quantized inference
  • Mixed-precision INT16 × INT8 GEMM/conv
  • Image processing (convolution, sum of squared differences, histogram).
  • Checksum algorithms (e.g. CRC)

Requires:

@fbusato fbusato self-assigned this May 18, 2026
@fbusato fbusato requested review from a team as code owners May 18, 2026 23:32
@fbusato fbusato added the libcu++ For all items related to libcu++ label May 18, 2026
@fbusato fbusato added this to CCCL May 18, 2026
@fbusato fbusato requested review from alliepiper and gonidelis May 18, 2026 23:32
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 18, 2026
@fbusato fbusato requested a review from ericniebler May 18, 2026 23:32
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 18, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Review Change Stack

Warning

Rate limit exceeded

@fbusato has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 45 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bd55219c-1d37-42ff-8555-a55538715033

📥 Commits

Reviewing files that changed from the base of the PR and between f08a379 and 4e548e6.

📒 Files selected for processing (20)
  • docs/libcudacxx/extended_api.rst
  • docs/libcudacxx/extended_api/simd.rst
  • docs/libcudacxx/extended_api/simd/abs_diff.rst
  • docs/libcudacxx/extended_api/simd/idot.rst
  • docs/libcudacxx/extended_api/simd/saturating_add.rst
  • libcudacxx/include/cuda/__simd/idot.h
  • libcudacxx/include/cuda/__simd/saturating_add.h
  • libcudacxx/include/cuda/__simd/simd_intrinsics.h
  • libcudacxx/include/cuda/__simd/simd_intrinsics_array.h
  • libcudacxx/include/cuda/__simd/vabsdiff.h
  • libcudacxx/include/cuda/simd
  • libcudacxx/include/cuda/std/__fwd/simd.h
  • libcudacxx/include/cuda/std/__internal/features.h
  • libcudacxx/include/cuda/std/__internal/namespaces.h
  • libcudacxx/include/cuda/std/__simd/basic_vec.h
  • libcudacxx/include/cuda/std/__simd/specializations/fixed_size_integral_vec.h
  • libcudacxx/include/cuda/std/__simd/specializations/fixed_size_storage.h
  • libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics.h
  • libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics_array.h
  • libcudacxx/include/cuda/std/__simd/type_traits.h
📝 Walkthrough

suggestion:

Walkthrough

This PR integrates three new SIMD integer operations (idot, saturating_add, abs_diff) into libcudacxx with compile-time feature detection, low-level device intrinsics behind SM capability gates, high-level operation wrappers with fallback scalar paths, storage alignment updates, a fixed-size small-integral specialization, comprehensive functional tests, and expanded codegen validation.

Changes

SIMD Intrinsics for Integer Operations

Layer / File(s) Summary
Feature detection, namespaces, and forward declarations
libcudacxx/include/cuda/std/__internal/features.h, libcudacxx/include/cuda/std/__internal/namespaces.h, libcudacxx/include/cuda/std/__fwd/simd.h
Adds feature macros _CCCL_HAS_SIMD_8BIT, _CCCL_HAS_SIMD_SAT, _CCCL_HAS_SIMD_VABSDIFF, _CCCL_HAS_SIMD_IDOT with PTX ISA thresholds and tile-compilation exclusions; introduces _CCCL_BEGIN_NAMESPACE_CUDA_SIMD macros; adds __fixed_size_integral to __simd_operations_kind enum.
Low-level hardware intrinsics
libcudacxx/include/cuda/__simd/simd_intrinsics.h, libcudacxx/include/cuda/simd
Defines device intrinsics __vadd_sat_*, __vabsdiff_*, and __dp4a_*/__dp2a_* with NV_IF_TARGET SM capability gating and inline PTX/builtin dispatch; includes header wiring for public SIMD namespace.
Array-based intrinsic wrappers and storage conversion
libcudacxx/include/cuda/__simd/simd_intrinsics_array.h, libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics_array.h
Device-only template helpers converting SIMD storage to uint32_t-backed arrays, plus __vadd_sat_*bit_*, __vabsdiff_*bit_*, __dp4a_*bit_*, __dp2a_*bit_* wrappers with if constexpr signedness dispatch and unrolled per-element loops.
Public SIMD operations (idot, saturating_add, abs_diff)
libcudacxx/include/cuda/__simd/idot.h, libcudacxx/include/cuda/__simd/saturating_add.h, libcudacxx/include/cuda/__simd/vabsdiff.h
Templates conditionally selecting intrinsic paths (when _CCCL_HAS_SIMD_*()) with fallback scalar/per-element implementations; constexpr-compatible, noexcept, nodiscard; integer type constraints on operands/accumulators.
Storage alignment and type traits
libcudacxx/include/cuda/std/__simd/specializations/fixed_size_storage.h, libcudacxx/include/cuda/std/__simd/type_traits.h, libcudacxx/include/cuda/std/__simd/basic_vec.h
Updates fixed-size storage alignment to max(alignof(T), 8); replaces conditional pointer alignment logic with __optimal_cuda_alignment (32 vs 16 based on CTK version); exposes storage declarations in basic_vec by removing private:/redundant public: markers; includes fixed_size_integral_vec.h.
Fixed-size small-integral SIMD specialization
libcudacxx/include/cuda/std/__simd/specializations/fixed_size_integral_vec.h
Specializes __simd_operations for sizeof(T) < sizeof(uint32_t) with uint32_t-based storage; implements bitwise ops (AND/OR/XOR/NOT) via unrolled per-lane loops and arithmetic ops (add/subtract/inc/dec/unary-minus) with optional SM90/SM120f intrinsic dispatch.
Functional tests (idot, saturating_add, abs_diff)
libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/*.pass.cpp
SFINAE-based compile-time enablement checks, scalar reference implementations for correctness validation, array-based test vectors with edge cases (min/max), signed/unsigned/int128 coverage, static_assert compatibility.
Alignment test suite updates
libcudacxx/test/libcudacxx/std/numerics/simd/simd.traits/alignment.pass.cpp
Refactors alignment validation to use platform-dependent optimal_cuda_alignment constant instead of size-based formula.
CMake codegen infrastructure
libcudacxx/test/simd_codegen/CMakeLists.txt
Detects Clang compiler and skips codegen tests; switches file discovery to targeted subdirectories; adds SM120f arch support with conditional compiler versions; introduces simd_codegen_get_check_prefixes and simd_codegen_set_cuda_arch helpers for arch-specific FileCheck and compilation settings.
Floating-point codegen tests (refactor)
libcudacxx/test/simd_codegen/floating_point/*.cu
Consolidates global kernels into device-only functions (F16/BF16 FMA, inc/dec, unary-minus, comparisons, multiply); updates SMXX labels and FileCheck directives; removes old global-kernel test files.
Integer and integer-dot-product codegen tests
libcudacxx/test/simd_codegen/{idot,saturation_add,vabsdiff,integer}/*.cu
Adds codegen validation for idot (DP4A/DP2A), saturating_add (VIADD variants), abs_diff (VABSDIFF/VIMNMX), and small-integer arithmetic/bitwise operations with per-function SMXX pattern matching.
Public API documentation
docs/libcudacxx/extended_api/simd.rst, docs/libcudacxx/extended_api/simd/{abs_diff,idot,saturating_add}.rst
Documents signatures, behavioral semantics, constraints, hardware performance notes (SM capability targeting), and runnable examples for all three new operations; adds toctree entry to extended API index.

Suggested labels: cudax

Suggested reviewers:

  • alliepiper
  • ericniebler
  • gonidelis

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libcudacxx/include/cuda/std/__simd/type_traits.h (1)

24-30: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add direct include for <cuda/std/__algorithm/max.h> to this header. Line 44 uses ::cuda::std::max, but this header is not included. This creates a transitive-include dependency, violating include self-sufficiency and risking breakage when include order changes.

🧹 Nitpick comments (3)
docs/libcudacxx/extended_api/simd/saturating_add.rst (1)

43-43: ⚡ Quick win

suggestion: Line 43 lists "VIADD.16x2" but the pattern suggests it should be "VIADD.U16x2" (unsigned 16-bit) to match the signed/unsigned pairs for 8-bit (VIADD.S8x4/VIADD.U8x4) and the signed 16-bit variant (VIADD.S16x2).

libcudacxx/test/simd_codegen/floating_point/plus_f32x2.cu (1)

25-26: ⚡ Quick win

suggestion: Add 1xx-family op checks in addition to SM100.
This currently validates FADD2 only for SM100; adding SM1XX (or explicit SM120f) keeps coverage aligned with the new 1xx arch support in the codegen infra.

libcudacxx/test/simd_codegen/floating_point/unary_minus_f32x2.cu (1)

25-26: ⚡ Quick win

suggestion: Mirror 1xx-family checks here as well.
SM100-only op assertions leave newer 1xx targets under-checked; adding SM1XX (or explicit SM120f) patterns would keep this test robust across the enabled arch set.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1b5405d0-67de-4b60-9f88-207fad08e963

📥 Commits

Reviewing files that changed from the base of the PR and between 9378ff3 and 984f9fd.

📒 Files selected for processing (51)
  • docs/libcudacxx/extended_api.rst
  • docs/libcudacxx/extended_api/simd.rst
  • docs/libcudacxx/extended_api/simd/abs_diff.rst
  • docs/libcudacxx/extended_api/simd/idot.rst
  • docs/libcudacxx/extended_api/simd/saturating_add.rst
  • libcudacxx/include/cuda/__simd/idot.h
  • libcudacxx/include/cuda/__simd/saturating_add.h
  • libcudacxx/include/cuda/__simd/simd_intrinsics.h
  • libcudacxx/include/cuda/__simd/simd_intrinsics_array.h
  • libcudacxx/include/cuda/__simd/vabsdiff.h
  • libcudacxx/include/cuda/simd
  • libcudacxx/include/cuda/std/__fwd/simd.h
  • libcudacxx/include/cuda/std/__internal/features.h
  • libcudacxx/include/cuda/std/__internal/namespaces.h
  • libcudacxx/include/cuda/std/__simd/basic_vec.h
  • libcudacxx/include/cuda/std/__simd/specializations/fixed_size_integral_vec.h
  • libcudacxx/include/cuda/std/__simd/specializations/fixed_size_storage.h
  • libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics.h
  • libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics_array.h
  • libcudacxx/include/cuda/std/__simd/type_traits.h
  • libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpp
  • libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/saturation_add.pass.cpp
  • libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/vabsdiff.pass.cpp
  • libcudacxx/test/libcudacxx/std/numerics/simd/simd.traits/alignment.pass.cpp
  • libcudacxx/test/simd_codegen/CMakeLists.txt
  • libcudacxx/test/simd_codegen/floating_point/decrement_f32x2.cu
  • libcudacxx/test/simd_codegen/floating_point/fma_bf16.cu
  • libcudacxx/test/simd_codegen/floating_point/fma_f16.cu
  • libcudacxx/test/simd_codegen/floating_point/increment_f32x2.cu
  • libcudacxx/test/simd_codegen/floating_point/less_bf16.cu
  • libcudacxx/test/simd_codegen/floating_point/less_f16.cu
  • libcudacxx/test/simd_codegen/floating_point/minus_f32x2.cu
  • libcudacxx/test/simd_codegen/floating_point/multiplies_bf16.cu
  • libcudacxx/test/simd_codegen/floating_point/multiplies_f16.cu
  • libcudacxx/test/simd_codegen/floating_point/plus_bf16.cu
  • libcudacxx/test/simd_codegen/floating_point/plus_f16.cu
  • libcudacxx/test/simd_codegen/floating_point/plus_f32x2.cu
  • libcudacxx/test/simd_codegen/floating_point/unary_minus_f32x2.cu
  • libcudacxx/test/simd_codegen/fma_bf16.cu
  • libcudacxx/test/simd_codegen/fma_f16.cu
  • libcudacxx/test/simd_codegen/idot/idp2.cu
  • libcudacxx/test/simd_codegen/idot/idp4.cu
  • libcudacxx/test/simd_codegen/integer/arithmetic_u16x2.cu
  • libcudacxx/test/simd_codegen/integer/arithmetic_u8x4.cu
  • libcudacxx/test/simd_codegen/integer/bitwise_u16x2_u8x4.cu
  • libcudacxx/test/simd_codegen/minus_f32x2.cu
  • libcudacxx/test/simd_codegen/multiplies_bf16.cu
  • libcudacxx/test/simd_codegen/plus_bf16.cu
  • libcudacxx/test/simd_codegen/plus_f32x2.cu
  • libcudacxx/test/simd_codegen/saturation_add/saturating_add.cu
  • libcudacxx/test/simd_codegen/vabsdiff/vabsdiff.cu
💤 Files with no reviewable changes (6)
  • libcudacxx/test/simd_codegen/multiplies_bf16.cu
  • libcudacxx/test/simd_codegen/plus_bf16.cu
  • libcudacxx/test/simd_codegen/minus_f32x2.cu
  • libcudacxx/test/simd_codegen/fma_bf16.cu
  • libcudacxx/test/simd_codegen/fma_f16.cu
  • libcudacxx/test/simd_codegen/plus_f32x2.cu

Comment thread libcudacxx/include/cuda/__simd/simd_intrinsics_array.h
Comment thread libcudacxx/include/cuda/std/__internal/features.h
Comment thread libcudacxx/test/simd_codegen/CMakeLists.txt
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1bf2e48-37e6-4ae3-9c10-bf0d22c8b76e

📥 Commits

Reviewing files that changed from the base of the PR and between 984f9fd and f08a379.

📒 Files selected for processing (4)
  • libcudacxx/include/cuda/__simd/simd_intrinsics.h
  • libcudacxx/include/cuda/__simd/simd_intrinsics_array.h
  • libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpp
  • libcudacxx/test/simd_codegen/idot/idp2.cu

Comment on lines +111 to +139
template <typename _Tp, typename _Up, typename _AccumT, ::cuda::std::size_t _Np>
[[nodiscard]] _CCCL_DEVICE_API _AccumT __dp4a_8bit_x4(
const ::cuda::std::simd::__array_u32_t<_Np>& __lhs_u,
const ::cuda::std::simd::__array_u32_t<_Np>& __rhs_u,
const _AccumT __acc) noexcept
{
_AccumT __result = __acc;
_CCCL_PRAGMA_UNROLL_FULL()
for (::cuda::std::size_t __i = 0; __i < _Np; ++__i)
{
if constexpr (::cuda::std::is_unsigned_v<_Tp> && ::cuda::std::is_unsigned_v<_Up>)
{
__result = ::cuda::simd::__dp4a_u8x4_u8x4(__lhs_u[__i], __rhs_u[__i], __result);
}
else if constexpr (::cuda::std::is_unsigned_v<_Tp>)
{
__result = ::cuda::simd::__dp4a_u8x4_s8x4(__lhs_u[__i], __rhs_u[__i], __result);
}
else if constexpr (::cuda::std::is_unsigned_v<_Up>)
{
__result = ::cuda::simd::__dp4a_s8x4_u8x4(__lhs_u[__i], __rhs_u[__i], __result);
}
else
{
__result = ::cuda::simd::__dp4a_s8x4_s8x4(__lhs_u[__i], __rhs_u[__i], __result);
}
}
return __result;
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n 'long long|unsigned long long|__int128_t' libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpp

python - <<'PY'
MASK = (1 << 32) - 1
acc = 1 << 40
chunk = 12345

current = ((acc & MASK) + chunk) & MASK
exact = acc + chunk

print(f"truncated helper pattern: {current}")
print(f"exact AccT accumulation: {exact}")
PY

Repository: NVIDIA/cccl

Length of output: 474


🏁 Script executed:

# Find the simd_intrinsics.h file to check intrinsic signatures
find . -name "simd_intrinsics.h" -type f

Repository: NVIDIA/cccl

Length of output: 177


🏁 Script executed:

# Check the intrinsic function signatures in simd_intrinsics.h
fd simd_intrinsics.h | head -1 | xargs cat | grep -A 10 "__dp4a_"

Repository: NVIDIA/cccl

Length of output: 3096


🏁 Script executed:

# Also check the dp2a intrinsics
fd simd_intrinsics.h | head -1 | xargs cat | grep -A 10 "__dp2a_"

Repository: NVIDIA/cccl

Length of output: 6501


🏁 Script executed:

cat -n libcudacxx/include/cuda/__simd/simd_intrinsics_array.h | sed -n '111,140p'

Repository: NVIDIA/cccl

Length of output: 1291


🏁 Script executed:

cat -n libcudacxx/include/cuda/__simd/simd_intrinsics_array.h | sed -n '141,200p'

Repository: NVIDIA/cccl

Length of output: 2279


important: wide accumulators are truncated to 32 bits every iteration.

At lines 123, 127, 131, 135 (and similarly at lines 156, 160, 167, 171, 178, 182, 189, 193), __dp4a_* and __dp2a_* intrinsics accept and return only 32-bit types, but _AccumT __result is passed directly without intermediate conversion. When _AccumT is wider than 32 bits (long long, unsigned long long, __int128_t), the implicit conversion to 32-bit truncates the upper bits before each intrinsic call, then the 32-bit result expands back with zero-filled upper bits. This breaks the idot contract: the running sum loses precision on every iteration once it exceeds 32 bits.

Use a 32-bit temporary for each intrinsic call, then add the widened result back into _AccumT __result instead of threading _AccumT through the instruction. Current test cases instantiate wide accumulators but use small values that fit within 32 bits, so the bug is not caught.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is actually a goood comment, I believe we should static assert that _AccumT is integral and has size 4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: agreed. The static_assert belongs inside __dp4a_8bit_x4 and __dp2a_16bit_x2_8bit_x4 — not on the public idot signature — so the scalar fallback in idot.h can still legally accept wider accumulators (e.g., __int128). Only the intrinsic helpers need the constraint since the hardware only speaks 32-bit.

Concrete addition at the top of each helper:

static_assert(::cuda::std::is_integral_v<_AccumT> && sizeof(_AccumT) == 4,
              "idot intrinsic path requires a 32-bit integral accumulator");

Same assert in __dp2a_16bit_x2_8bit_x4. The public idot.h dispatch already checks sizes/signs before routing into these helpers, so in practice the assert should never fire — but it makes the contract explicit and catches future callers that bypass the routing logic.


🧠 Learnings used
Learnt from: sleeepyjack
Repo: NVIDIA/cccl PR: 8857
File: cudax/include/cuda/experimental/__cuco/__hyperloglog/default_policy.cuh:92-95
Timestamp: 2026-05-06T23:47:10.296Z
Learning: In this CCCL (CUDA C++ Core Libraries) codebase, do NOT treat `_CCCL_API` as host-only. `_CCCL_API` is defined in `libcudacxx/include/cuda/std/__cccl/visibility.h` and always expands to include `_CCCL_HOST_DEVICE` (possibly alongside `_CCCL_TILE`, `_CCCL_VISIBILITY_HIDDEN`, and `_CCCL_EXCLUDE_FROM_EXPLICIT_INSTANTIATION`). So any function annotated with `_CCCL_API` is already valid for both host and device code; reviewers should not flag it as host-only.

Learnt from: sleeepyjack
Repo: NVIDIA/cccl PR: 8857
File: cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh:426-428
Timestamp: 2026-05-06T23:47:17.597Z
Learning: In the CCCL/CUDA C++ Core Libraries codebase (NVIDIA/cccl), treat `_CCCL_API` as host-and-device callable. `_CCCL_API` expands to `_CCCL_HOST_DEVICE` via `libcudacxx/include/cuda/std/__cccl/visibility.h`, so functions marked with `_CCCL_API` may be invoked from both host and device code. When reviewing device-code contexts, do not flag `_CCCL_API`-annotated functions as if they were host-only.

Comment on lines +61 to +95
/*

; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s16_s8.*}}
; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
; SMXX-DAG: {{.*IDP\.2A.*HI.*}}

; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s8_s16.*}}
; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
; SMXX-DAG: {{.*IDP\.2A.*HI.*}}

; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s16_u8.*}}
; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
; SMXX-DAG: {{.*IDP\.2A.*HI.*}}

; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u8_s16.*}}
; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
; SMXX-DAG: {{.*IDP\.2A.*HI.*}}

; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u16_s8.*}}
; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
; SMXX-DAG: {{.*IDP\.2A.*HI.*}}

; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s8_u16.*}}
; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
; SMXX-DAG: {{.*IDP\.2A.*HI.*}}

; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u16_u8.*}}
; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
; SMXX-DAG: {{.*IDP\.2A.*HI.*}}

; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u8_u16.*}}
; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
; SMXX-DAG: {{.*IDP\.2A.*HI.*}}

*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n 'SMXX-(LABEL|DAG)' libcudacxx/test/simd_codegen/idot/idp2.cu

Repository: NVIDIA/cccl

Length of output: 1151


important: Anchor codegen assertions to specific functions. Lines 63-93 use only SMXX-DAG matches, allowing function names and instruction patterns to float globally across the file. A broken mixed-type wrapper can pass if any other wrapper emits the required instruction. Use SMXX-LABEL to anchor each function name, with the instruction checks (IDP.2A.*LO and IDP.2A.*HI) immediately following under that anchor.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 6h 56m: Pass: 92%/116 | Total: 5d 13h | Max: 5h 34m | Hits: 56%/2884979

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

libcu++ For all items related to libcu++

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants