`cuda::simd` Integer dot product by fbusato · Pull Request #9064 · NVIDIA/cccl

fbusato · 2026-05-18T23:32:07Z

Description

Introduce SIMD idot to compute the dot product of two integer vectors.
The operation maps to IDP4/IDP2 HW instructions.
The PR includes the implementation, unit test, documentation, and codegen checks.

Use cases

INT8 quantized inference
Mixed-precision INT16 × INT8 GEMM/conv
Image processing (convolution, sum of squared differences, histogram).
Checksum algorithms (e.g. CRC)

Requires:

cuda::simd Add abs_diff #8994

coderabbitai · 2026-05-18T23:39:16Z

Warning

Rate limit exceeded

@fbusato has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 45 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bd55219c-1d37-42ff-8555-a55538715033

📥 Commits

Reviewing files that changed from the base of the PR and between f08a379 and 4e548e6.

📒 Files selected for processing (20)

docs/libcudacxx/extended_api.rst
docs/libcudacxx/extended_api/simd.rst
docs/libcudacxx/extended_api/simd/abs_diff.rst
docs/libcudacxx/extended_api/simd/idot.rst
docs/libcudacxx/extended_api/simd/saturating_add.rst
libcudacxx/include/cuda/__simd/idot.h
libcudacxx/include/cuda/__simd/saturating_add.h
libcudacxx/include/cuda/__simd/simd_intrinsics.h
libcudacxx/include/cuda/__simd/simd_intrinsics_array.h
libcudacxx/include/cuda/__simd/vabsdiff.h
libcudacxx/include/cuda/simd
libcudacxx/include/cuda/std/__fwd/simd.h
libcudacxx/include/cuda/std/__internal/features.h
libcudacxx/include/cuda/std/__internal/namespaces.h
libcudacxx/include/cuda/std/__simd/basic_vec.h
libcudacxx/include/cuda/std/__simd/specializations/fixed_size_integral_vec.h
libcudacxx/include/cuda/std/__simd/specializations/fixed_size_storage.h
libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics.h
libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics_array.h
libcudacxx/include/cuda/std/__simd/type_traits.h

📝 Walkthrough

suggestion:

Walkthrough

This PR integrates three new SIMD integer operations (idot, saturating_add, abs_diff) into libcudacxx with compile-time feature detection, low-level device intrinsics behind SM capability gates, high-level operation wrappers with fallback scalar paths, storage alignment updates, a fixed-size small-integral specialization, comprehensive functional tests, and expanded codegen validation.

Changes

SIMD Intrinsics for Integer Operations

Layer / File(s)	Summary
Feature detection, namespaces, and forward declarations `libcudacxx/include/cuda/std/__internal/features.h`, `libcudacxx/include/cuda/std/__internal/namespaces.h`, `libcudacxx/include/cuda/std/__fwd/simd.h`	Adds feature macros `_CCCL_HAS_SIMD_8BIT`, `_CCCL_HAS_SIMD_SAT`, `_CCCL_HAS_SIMD_VABSDIFF`, `_CCCL_HAS_SIMD_IDOT` with PTX ISA thresholds and tile-compilation exclusions; introduces `_CCCL_BEGIN_NAMESPACE_CUDA_SIMD` macros; adds `__fixed_size_integral` to `__simd_operations_kind` enum.
Low-level hardware intrinsics `libcudacxx/include/cuda/__simd/simd_intrinsics.h`, `libcudacxx/include/cuda/simd`	Defines device intrinsics `__vadd_sat_`, `__vabsdiff_`, and `__dp4a_`/`__dp2a_` with NV_IF_TARGET SM capability gating and inline PTX/builtin dispatch; includes header wiring for public SIMD namespace.
Array-based intrinsic wrappers and storage conversion `libcudacxx/include/cuda/__simd/simd_intrinsics_array.h`, `libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics_array.h`	Device-only template helpers converting SIMD storage to uint32_t-backed arrays, plus `__vadd_sat_bit_`, `__vabsdiff_bit_`, `__dp4a_bit_`, `__dp2a_bit_` wrappers with `if constexpr` signedness dispatch and unrolled per-element loops.
Public SIMD operations (idot, saturating_add, abs_diff) `libcudacxx/include/cuda/__simd/idot.h`, `libcudacxx/include/cuda/__simd/saturating_add.h`, `libcudacxx/include/cuda/__simd/vabsdiff.h`	Templates conditionally selecting intrinsic paths (when `_CCCL_HAS_SIMD_*()`) with fallback scalar/per-element implementations; constexpr-compatible, noexcept, nodiscard; integer type constraints on operands/accumulators.
Storage alignment and type traits `libcudacxx/include/cuda/std/__simd/specializations/fixed_size_storage.h`, `libcudacxx/include/cuda/std/__simd/type_traits.h`, `libcudacxx/include/cuda/std/__simd/basic_vec.h`	Updates fixed-size storage alignment to `max(alignof(T), 8)`; replaces conditional pointer alignment logic with `__optimal_cuda_alignment` (32 vs 16 based on CTK version); exposes storage declarations in `basic_vec` by removing `private:`/redundant `public:` markers; includes `fixed_size_integral_vec.h`.
Fixed-size small-integral SIMD specialization `libcudacxx/include/cuda/std/__simd/specializations/fixed_size_integral_vec.h`	Specializes `__simd_operations` for `sizeof(T) < sizeof(uint32_t)` with uint32_t-based storage; implements bitwise ops (AND/OR/XOR/NOT) via unrolled per-lane loops and arithmetic ops (add/subtract/inc/dec/unary-minus) with optional SM90/SM120f intrinsic dispatch.
Functional tests (idot, saturating_add, abs_diff) `libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/*.pass.cpp`	SFINAE-based compile-time enablement checks, scalar reference implementations for correctness validation, array-based test vectors with edge cases (min/max), signed/unsigned/int128 coverage, static_assert compatibility.
Alignment test suite updates `libcudacxx/test/libcudacxx/std/numerics/simd/simd.traits/alignment.pass.cpp`	Refactors alignment validation to use platform-dependent `optimal_cuda_alignment` constant instead of size-based formula.
CMake codegen infrastructure `libcudacxx/test/simd_codegen/CMakeLists.txt`	Detects Clang compiler and skips codegen tests; switches file discovery to targeted subdirectories; adds SM120f arch support with conditional compiler versions; introduces `simd_codegen_get_check_prefixes` and `simd_codegen_set_cuda_arch` helpers for arch-specific FileCheck and compilation settings.
Floating-point codegen tests (refactor) `libcudacxx/test/simd_codegen/floating_point/*.cu`	Consolidates global kernels into device-only functions (F16/BF16 FMA, inc/dec, unary-minus, comparisons, multiply); updates SMXX labels and FileCheck directives; removes old global-kernel test files.
Integer and integer-dot-product codegen tests `libcudacxx/test/simd_codegen/{idot,saturation_add,vabsdiff,integer}/*.cu`	Adds codegen validation for idot (DP4A/DP2A), saturating_add (VIADD variants), abs_diff (VABSDIFF/VIMNMX), and small-integer arithmetic/bitwise operations with per-function SMXX pattern matching.
Public API documentation `docs/libcudacxx/extended_api/simd.rst`, `docs/libcudacxx/extended_api/simd/{abs_diff,idot,saturating_add}.rst`	Documents signatures, behavioral semantics, constraints, hardware performance notes (SM capability targeting), and runnable examples for all three new operations; adds toctree entry to extended API index.

Suggested labels: cudax

Suggested reviewers:

alliepiper
ericniebler
gonidelis

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libcudacxx/include/cuda/std/__simd/type_traits.h (1)

24-30: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add direct include for <cuda/std/__algorithm/max.h> to this header. Line 44 uses ::cuda::std::max, but this header is not included. This creates a transitive-include dependency, violating include self-sufficiency and risking breakage when include order changes.

🧹 Nitpick comments (3)

docs/libcudacxx/extended_api/simd/saturating_add.rst (1)

43-43: ⚡ Quick win

suggestion: Line 43 lists "VIADD.16x2" but the pattern suggests it should be "VIADD.U16x2" (unsigned 16-bit) to match the signed/unsigned pairs for 8-bit (VIADD.S8x4/VIADD.U8x4) and the signed 16-bit variant (VIADD.S16x2).

libcudacxx/test/simd_codegen/floating_point/plus_f32x2.cu (1)

25-26: ⚡ Quick win

suggestion: Add 1xx-family op checks in addition to SM100.
This currently validates FADD2 only for SM100; adding SM1XX (or explicit SM120f) keeps coverage aligned with the new 1xx arch support in the codegen infra.

libcudacxx/test/simd_codegen/floating_point/unary_minus_f32x2.cu (1)

25-26: ⚡ Quick win

suggestion: Mirror 1xx-family checks here as well.
SM100-only op assertions leave newer 1xx targets under-checked; adding SM1XX (or explicit SM120f) patterns would keep this test robust across the enabled arch set.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1b5405d0-67de-4b60-9f88-207fad08e963

📥 Commits

Reviewing files that changed from the base of the PR and between 9378ff3 and 984f9fd.

📒 Files selected for processing (51)

docs/libcudacxx/extended_api.rst
docs/libcudacxx/extended_api/simd.rst
docs/libcudacxx/extended_api/simd/abs_diff.rst
docs/libcudacxx/extended_api/simd/idot.rst
docs/libcudacxx/extended_api/simd/saturating_add.rst
libcudacxx/include/cuda/__simd/idot.h
libcudacxx/include/cuda/__simd/saturating_add.h
libcudacxx/include/cuda/__simd/simd_intrinsics.h
libcudacxx/include/cuda/__simd/simd_intrinsics_array.h
libcudacxx/include/cuda/__simd/vabsdiff.h
libcudacxx/include/cuda/simd
libcudacxx/include/cuda/std/__fwd/simd.h
libcudacxx/include/cuda/std/__internal/features.h
libcudacxx/include/cuda/std/__internal/namespaces.h
libcudacxx/include/cuda/std/__simd/basic_vec.h
libcudacxx/include/cuda/std/__simd/specializations/fixed_size_integral_vec.h
libcudacxx/include/cuda/std/__simd/specializations/fixed_size_storage.h
libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics.h
libcudacxx/include/cuda/std/__simd/specializations/simd_intrinsics_array.h
libcudacxx/include/cuda/std/__simd/type_traits.h
libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpp
libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/saturation_add.pass.cpp
libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/vabsdiff.pass.cpp
libcudacxx/test/libcudacxx/std/numerics/simd/simd.traits/alignment.pass.cpp
libcudacxx/test/simd_codegen/CMakeLists.txt
libcudacxx/test/simd_codegen/floating_point/decrement_f32x2.cu
libcudacxx/test/simd_codegen/floating_point/fma_bf16.cu
libcudacxx/test/simd_codegen/floating_point/fma_f16.cu
libcudacxx/test/simd_codegen/floating_point/increment_f32x2.cu
libcudacxx/test/simd_codegen/floating_point/less_bf16.cu
libcudacxx/test/simd_codegen/floating_point/less_f16.cu
libcudacxx/test/simd_codegen/floating_point/minus_f32x2.cu
libcudacxx/test/simd_codegen/floating_point/multiplies_bf16.cu
libcudacxx/test/simd_codegen/floating_point/multiplies_f16.cu
libcudacxx/test/simd_codegen/floating_point/plus_bf16.cu
libcudacxx/test/simd_codegen/floating_point/plus_f16.cu
libcudacxx/test/simd_codegen/floating_point/plus_f32x2.cu
libcudacxx/test/simd_codegen/floating_point/unary_minus_f32x2.cu
libcudacxx/test/simd_codegen/fma_bf16.cu
libcudacxx/test/simd_codegen/fma_f16.cu
libcudacxx/test/simd_codegen/idot/idp2.cu
libcudacxx/test/simd_codegen/idot/idp4.cu
libcudacxx/test/simd_codegen/integer/arithmetic_u16x2.cu
libcudacxx/test/simd_codegen/integer/arithmetic_u8x4.cu
libcudacxx/test/simd_codegen/integer/bitwise_u16x2_u8x4.cu
libcudacxx/test/simd_codegen/minus_f32x2.cu
libcudacxx/test/simd_codegen/multiplies_bf16.cu
libcudacxx/test/simd_codegen/plus_bf16.cu
libcudacxx/test/simd_codegen/plus_f32x2.cu
libcudacxx/test/simd_codegen/saturation_add/saturating_add.cu
libcudacxx/test/simd_codegen/vabsdiff/vabsdiff.cu

💤 Files with no reviewable changes (6)

libcudacxx/test/simd_codegen/multiplies_bf16.cu
libcudacxx/test/simd_codegen/plus_bf16.cu
libcudacxx/test/simd_codegen/minus_f32x2.cu
libcudacxx/test/simd_codegen/fma_bf16.cu
libcudacxx/test/simd_codegen/fma_f16.cu
libcudacxx/test/simd_codegen/plus_f32x2.cu

coderabbitai

Actionable comments posted: 2

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e1bf2e48-37e6-4ae3-9c10-bf0d22c8b76e

📥 Commits

Reviewing files that changed from the base of the PR and between 984f9fd and f08a379.

📒 Files selected for processing (4)

libcudacxx/include/cuda/__simd/simd_intrinsics.h
libcudacxx/include/cuda/__simd/simd_intrinsics_array.h
libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpp
libcudacxx/test/simd_codegen/idot/idp2.cu

coderabbitai · 2026-05-19T00:02:51Z

+template <typename _Tp, typename _Up, typename _AccumT, ::cuda::std::size_t _Np>
+[[nodiscard]] _CCCL_DEVICE_API _AccumT __dp4a_8bit_x4(
+  const ::cuda::std::simd::__array_u32_t<_Np>& __lhs_u,
+  const ::cuda::std::simd::__array_u32_t<_Np>& __rhs_u,
+  const _AccumT __acc) noexcept
+{
+  _AccumT __result = __acc;
+  _CCCL_PRAGMA_UNROLL_FULL()
+  for (::cuda::std::size_t __i = 0; __i < _Np; ++__i)
+  {
+    if constexpr (::cuda::std::is_unsigned_v<_Tp> && ::cuda::std::is_unsigned_v<_Up>)
+    {
+      __result = ::cuda::simd::__dp4a_u8x4_u8x4(__lhs_u[__i], __rhs_u[__i], __result);
+    }
+    else if constexpr (::cuda::std::is_unsigned_v<_Tp>)
+    {
+      __result = ::cuda::simd::__dp4a_u8x4_s8x4(__lhs_u[__i], __rhs_u[__i], __result);
+    }
+    else if constexpr (::cuda::std::is_unsigned_v<_Up>)
+    {
+      __result = ::cuda::simd::__dp4a_s8x4_u8x4(__lhs_u[__i], __rhs_u[__i], __result);
+    }
+    else
+    {
+      __result = ::cuda::simd::__dp4a_s8x4_s8x4(__lhs_u[__i], __rhs_u[__i], __result);
+    }
+  }
+  return __result;
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n 'long long|unsigned long long|__int128_t' libcudacxx/test/libcudacxx/std/numerics/simd/simd.non_std/idot.pass.cpp python - <<'PY' MASK = (1 << 32) - 1 acc = 1 << 40 chunk = 12345 current = ((acc & MASK) + chunk) & MASK exact = acc + chunk print(f"truncated helper pattern: {current}") print(f"exact AccT accumulation: {exact}") PY

Repository: NVIDIA/cccl

Length of output: 474

🏁 Script executed:

# Find the simd_intrinsics.h file to check intrinsic signatures find . -name "simd_intrinsics.h" -type f

Repository: NVIDIA/cccl

Length of output: 177

🏁 Script executed:

# Check the intrinsic function signatures in simd_intrinsics.h fd simd_intrinsics.h | head -1 | xargs cat | grep -A 10 "__dp4a_"

Repository: NVIDIA/cccl

Length of output: 3096

🏁 Script executed:

# Also check the dp2a intrinsics fd simd_intrinsics.h | head -1 | xargs cat | grep -A 10 "__dp2a_"

Repository: NVIDIA/cccl

Length of output: 6501

🏁 Script executed:

cat -n libcudacxx/include/cuda/__simd/simd_intrinsics_array.h | sed -n '111,140p'

Repository: NVIDIA/cccl

Length of output: 1291

🏁 Script executed:

cat -n libcudacxx/include/cuda/__simd/simd_intrinsics_array.h | sed -n '141,200p'

Repository: NVIDIA/cccl

Length of output: 2279

important: wide accumulators are truncated to 32 bits every iteration.

At lines 123, 127, 131, 135 (and similarly at lines 156, 160, 167, 171, 178, 182, 189, 193), __dp4a_* and __dp2a_* intrinsics accept and return only 32-bit types, but _AccumT __result is passed directly without intermediate conversion. When _AccumT is wider than 32 bits (long long, unsigned long long, __int128_t), the implicit conversion to 32-bit truncates the upper bits before each intrinsic call, then the 32-bit result expands back with zero-filled upper bits. This breaks the idot contract: the running sum loses precision on every iteration once it exceeds 32 bits.

Use a 32-bit temporary for each intrinsic call, then add the widened result back into _AccumT __result instead of threading _AccumT through the instruction. Current test cases instantiate wide accumulators but use small values that fit within 32 bits, so the bug is not caught.

That is actually a goood comment, I believe we should static assert that _AccumT is integral and has size 4

suggestion: agreed. The static_assert belongs inside __dp4a_8bit_x4 and __dp2a_16bit_x2_8bit_x4 — not on the public idot signature — so the scalar fallback in idot.h can still legally accept wider accumulators (e.g., __int128). Only the intrinsic helpers need the constraint since the hardware only speaks 32-bit.

Concrete addition at the top of each helper:

static_assert(::cuda::std::is_integral_v<_AccumT> && sizeof(_AccumT) == 4, "idot intrinsic path requires a 32-bit integral accumulator");

Same assert in __dp2a_16bit_x2_8bit_x4. The public idot.h dispatch already checks sizes/signs before routing into these helpers, so in practice the assert should never fire — but it makes the contract explicit and catches future callers that bypass the routing logic.

🧠 Learnings used

Learnt from: sleeepyjack Repo: NVIDIA/cccl PR: 8857 File: cudax/include/cuda/experimental/__cuco/__hyperloglog/default_policy.cuh:92-95 Timestamp: 2026-05-06T23:47:10.296Z Learning: In this CCCL (CUDA C++ Core Libraries) codebase, do NOT treat `_CCCL_API` as host-only. `_CCCL_API` is defined in `libcudacxx/include/cuda/std/__cccl/visibility.h` and always expands to include `_CCCL_HOST_DEVICE` (possibly alongside `_CCCL_TILE`, `_CCCL_VISIBILITY_HIDDEN`, and `_CCCL_EXCLUDE_FROM_EXPLICIT_INSTANTIATION`). So any function annotated with `_CCCL_API` is already valid for both host and device code; reviewers should not flag it as host-only.

Learnt from: sleeepyjack Repo: NVIDIA/cccl PR: 8857 File: cudax/include/cuda/experimental/__cuco/__hyperloglog/hyperloglog_impl.cuh:426-428 Timestamp: 2026-05-06T23:47:17.597Z Learning: In the CCCL/CUDA C++ Core Libraries codebase (NVIDIA/cccl), treat `_CCCL_API` as host-and-device callable. `_CCCL_API` expands to `_CCCL_HOST_DEVICE` via `libcudacxx/include/cuda/std/__cccl/visibility.h`, so functions marked with `_CCCL_API` may be invoked from both host and device code. When reviewing device-code contexts, do not flag `_CCCL_API`-annotated functions as if they were host-only.

coderabbitai · 2026-05-19T00:02:51Z

+/*
+
+; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s16_s8.*}}
+; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
+; SMXX-DAG: {{.*IDP\.2A.*HI.*}}
+
+; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s8_s16.*}}
+; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
+; SMXX-DAG: {{.*IDP\.2A.*HI.*}}
+
+; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s16_u8.*}}
+; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
+; SMXX-DAG: {{.*IDP\.2A.*HI.*}}
+
+; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u8_s16.*}}
+; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
+; SMXX-DAG: {{.*IDP\.2A.*HI.*}}
+
+; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u16_s8.*}}
+; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
+; SMXX-DAG: {{.*IDP\.2A.*HI.*}}
+
+; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_s8_u16.*}}
+; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
+; SMXX-DAG: {{.*IDP\.2A.*HI.*}}
+
+; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u16_u8.*}}
+; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
+; SMXX-DAG: {{.*IDP\.2A.*HI.*}}
+
+; SMXX-DAG: {{[[:space:]]*}}Function : {{.*test_idot_u8_u16.*}}
+; SMXX-DAG: {{.*IDP\.2A.*LO.*}}
+; SMXX-DAG: {{.*IDP\.2A.*HI.*}}
+
+*/


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n 'SMXX-(LABEL|DAG)' libcudacxx/test/simd_codegen/idot/idp2.cu

Repository: NVIDIA/cccl

Length of output: 1151

important: Anchor codegen assertions to specific functions. Lines 63-93 use only SMXX-DAG matches, allowing function names and instruction patterns to float globally across the file. A broken mixed-type wrapper can pass if any other wrapper emits the required instruction. Use SMXX-LABEL to anchor each function name, with the instruction checks (IDP.2A.*LO and IDP.2A.*HI) immediately following under that anchor.

github-actions · 2026-05-20T05:46:17Z

😬 CI Workflow Results

🟥 Finished in 6h 56m: Pass: 92%/116 | Total: 5d 13h | Max: 5h 34m | Hits: 56%/2884979

See results here.

fbusato self-assigned this May 18, 2026

fbusato requested review from a team as code owners May 18, 2026 23:32

fbusato added the libcu++ For all items related to libcu++ label May 18, 2026

fbusato added this to CCCL May 18, 2026

fbusato requested review from alliepiper and gonidelis May 18, 2026 23:32

github-project-automation Bot moved this to Todo in CCCL May 18, 2026

fbusato requested a review from ericniebler May 18, 2026 23:32

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 18, 2026

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/__simd/simd_intrinsics_array.h

Comment thread libcudacxx/include/cuda/std/__internal/features.h

Comment thread libcudacxx/test/simd_codegen/CMakeLists.txt

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

fbusato added 4 commits May 19, 2026 15:40

cuda::std::simd Optimize small integer operations

82dd526

cuda::simd Add saturation_add

cbb03d9

cuda::simd Add abs_diff

f690b2e

cuda::simd Integer dot product

4e548e6

fbusato force-pushed the simd-cuda-idot branch from f08a379 to 4e548e6 Compare May 19, 2026 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cuda::simd` Integer dot product#9064

`cuda::simd` Integer dot product#9064
fbusato wants to merge 4 commits into
NVIDIA:mainfrom
fbusato:simd-cuda-idot

fbusato commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026 •

edited

Loading

Uh oh!

miscco May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

This comment has been minimized.

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fbusato commented May 18, 2026

Description

Description

Use cases

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miscco May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

github-actions Bot commented May 20, 2026

😬 CI Workflow Results

🟥 Finished in 6h 56m: Pass: 92%/116 | Total: 5d 13h | Max: 5h 34m | Hits: 56%/2884979

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot May 19, 2026 •

edited

Loading