⚡ Thunderbolt: max_v2 — AVX2 4x unrolled horizontal reduction by bugparty · Pull Request #32 · bugparty/cpu_math_kernels_pri

bugparty · 2026-04-24T20:18:53Z

💡 What: Added max_v2, an AVX2-vectorized and 4x unrolled max reduction kernel.
🎯 Why: The naive scalar max reduction is bounded by a loop-carried dependency (comparing each element to the running maximum), leading to poor instruction-level parallelism (ILP).
🏗️ How:

Used _mm256_max_ps to process 8 elements per cycle.
Unrolled the main loop 4x to maintain 4 independent accumulator vectors, hiding latency and saturating execution ports.
Performed an in-register horizontal tree reduction using _mm_max_ps and shuffles at the end instead of extracting vector lanes to a scalar array.
📊 Impact:
Before: ~0.63 GFLOP/s (N=16,384,000)
After: ~2.93 GFLOP/s (N=16,384,000)
Speedup: ~4.6x
🖥️ Tested on: Haswell+ / Zen+ compat (AVX2), GNU Compiler
🔬 How to reproduce:

make -C build -j4
DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter max --sizes 16384000

PR created automatically by Jules for task 11079407733403411578 started by @bugparty

Summary by CodeRabbit

New Features
- Added optimized maximum value computation achieving ~4.5x throughput improvement.
Documentation
- Added guidance for array reduction optimization techniques and benchmarking results.
Tests
- Extended benchmarking suite to validate new optimized functions.

Adds `max_v2` implementation in `max.h`, utilizing AVX2 SIMD with 4x unrolling to break loop-carried dependencies present in the naive version. Employs in-register horizontal tree reduction to avoid scalar extraction bottlenecks. Integrates the benchmark into `kernel_bench.cpp`, with a custom MaxBenchmarkBase for accurate GFLOPs reporting. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-04-24T20:18:54Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-04-24T20:19:07Z

📝 Walkthrough

Walkthrough

Introduces AVX2-optimized max reduction implementation with corresponding documentation and benchmarking infrastructure. The max_v2 function processes float arrays using parallel vector accumulators and horizontal in-register reduction, achieving significant throughput improvements over scalar approaches.

Changes

Cohort / File(s)	Summary
Documentation `.jules/thunderbolt.md`	Added guidance for AVX2 max reduction optimization with benchmark results showing ~4.5x throughput improvement using vectorization and multiple accumulators.
Core Implementation `ml_kernels/include/ml_kernels/max.h`	New header defining `max_v2` function using AVX2 intrinsics: 4-wide unrolled vector loads with independent accumulators, in-register horizontal max reduction via shuffle operations, plus scalar/vector remainder loops.
Benchmark Infrastructure `ml_kernels/src/kernel_bench.cpp`	Refactored `MaxBenchmark` into base class, added `MaxV2Benchmark` with pooled randomized inputs (~100MB), reference computation, round-robin execution, and epsilon-based correctness verification (≤1e-6 error).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Four vectors dancing in perfect stride,
Shuffling maximums far and wide,
AVX2 whispers through the lanes,
Where throughput flows through silicon veins,
A speedy hop to victory's gain! ⚡

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly matches the main changeset: adds max_v2, an AVX2-optimized kernel with 4x loop unrolling and horizontal reduction.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt/avx2-max-reduction-11079407733403411578

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

ml_kernels/src/kernel_bench.cpp (1)
453-462: Minor polish: redundant flops override and uninitialized members.

The flops(int n) override at Line 453–455 duplicates MaxBenchmarkBase::flops (Line 143) and can be removed. Same applies to the pre-existing MaxBenchmark::flops at Line 191 now that the base provides it.

result_, result_ref_, and pool_size_ (Lines 459–461) lack default initializers, unlike MaxBenchmark (Lines 195–197). Not a bug since setup() always runs first, but worth aligning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/src/kernel_bench.cpp` around lines 453 - 462, Remove the redundant
flops override(s): delete the flops(int n) implementation in this class (and the
duplicated MaxBenchmark::flops if present) so the class inherits
MaxBenchmarkBase::flops; also add default member initializers for result_,
result_ref_ (e.g. = 0.0f) and pool_size_ (e.g. = 0) to match MaxBenchmark's
initialization and avoid uninitialized fields—reference the flops method and the
members result_, result_ref_, and pool_size_ when making the changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.jules/thunderbolt.md:
- Line 19: The entry header "## 2024-04-24 - AVX2 Max Reduction Optimization"
has the wrong year; update the date to the PR creation date by changing that
header to "## 2026-04-24 - AVX2 Max Reduction Optimization" so the thunderbolt
changelog reflects the correct 2026-04-24 date.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 407-410: Move the stray `#include` "ml_kernels/max.h" and the
MaxV2Benchmark class (which inherits MaxBenchmarkBase) along with its
REGISTER_BENCHMARK(...) call into the existing anonymous namespace where the
other benchmarks live (e.g., place them right after
REGISTER_BENCHMARK(MaxBenchmark)); remove the duplicate/trailing MaxV2Benchmark
block after main(). Ensure the include is relocated up with the other kernel
headers and that MaxV2Benchmark and its registration use the internal anonymous
namespace so linkage and organization match the other benchmarks.
- Line 458: MaxV2Benchmark currently declares inputs_ as
std::vector<std::vector<float>> and omits bytes_accessed, causing
alignment/allocator mismatch with MaxBenchmark (which uses
std::vector<AlignedBuffer<float>>) and a missing bandwidth report; change
MaxV2Benchmark to use the same inputs_ type as MaxBenchmark
(std::vector<AlignedBuffer<float>>) and implement/override bytes_accessed()
consistently; better yet, move shared members (inputs_, result_, result_ref_,
pool_size_, current_idx_) and shared lifecycle methods (setup(), verify(),
teardown(), bytes_accessed()) into MaxBenchmarkBase and leave only name() and
run() implemented in MaxBenchmark and MaxV2Benchmark so both variants use the
same aligned buffers and bandwidth accounting.

---

Nitpick comments:
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 453-462: Remove the redundant flops override(s): delete the
flops(int n) implementation in this class (and the duplicated
MaxBenchmark::flops if present) so the class inherits MaxBenchmarkBase::flops;
also add default member initializers for result_, result_ref_ (e.g. = 0.0f) and
pool_size_ (e.g. = 0) to match MaxBenchmark's initialization and avoid
uninitialized fields—reference the flops method and the members result_,
result_ref_, and pool_size_ when making the changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9b02119-1b94-47b3-9112-c037e1216acd

📥 Commits

Reviewing files that changed from the base of the PR and between 88205de and 55c046b.

📒 Files selected for processing (3)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/max.h
ml_kernels/src/kernel_bench.cpp

coderabbitai · 2026-04-24T20:24:21Z

 **Evidence:** Microbenchmarking `exp256_ps` independently with a 4x unroll loop showed Horner's evaluating in 419ms vs. Estrin's 548ms. Integrating this (`exp256_ps_v2`) into `softmax_v5` resulted in a ~13.8% speedup (5.1 GFLOP/s vs `softmax_v4`'s 4.48 GFLOP/s).

 **Action:** When a loop is heavily unrolled to hide FMA latency, default to Horner's scheme rather than Estrin's to reduce instruction count and port pressure. Reserve Estrin's scheme for dependency-bound single-stream calculations. Always use `cvtps_epi32` over `round_ps` if the default MXCSR rounding mode (round-to-nearest) is acceptable.
+## 2024-04-24 - AVX2 Max Reduction Optimization


⚠️ Potential issue | 🟡 Minor

Date is off by two years.

This PR was created on 2026-04-24, but the new entry is dated 2024-04-24.

-## 2024-04-24 - AVX2 Max Reduction Optimization +## 2026-04-24 - AVX2 Max Reduction Optimization

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## 2024-04-24 - AVX2 Max Reduction Optimization

## 2026-04-24 - AVX2 Max Reduction Optimization

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.jules/thunderbolt.md at line 19, The entry header "## 2024-04-24 - AVX2 Max Reduction Optimization" has the wrong year; update the date to the PR creation date by changing that header to "## 2026-04-24 - AVX2 Max Reduction Optimization" so the thunderbolt changelog reflects the correct 2026-04-24 date.

coderabbitai · 2026-04-24T20:24:22Z

+
+#include "ml_kernels/max.h"
+
+class MaxV2Benchmark : public MaxBenchmarkBase {


🛠️ Refactor suggestion | 🟠 Major

Move #include and MaxV2Benchmark into the anonymous namespace at the top of the file.

The #include "ml_kernels/max.h" at Line 408 sits after main(), and MaxV2Benchmark (Line 410) is defined at global scope — all the other benchmarks (MaxBenchmark, SoftmaxV2Benchmark, SoftmaxV3Benchmark, …) live inside the anonymous namespace that closes at Line 335. It happens to compile because MaxBenchmarkBase is referenced via internal linkage and REGISTER_BENCHMARK does static-initialization registration, but the placement is inconsistent with every other registered benchmark and makes the file harder to navigate.

♻️ Suggested organization

Move the include up to the other kernel headers (near Line 16–18):

`#include` "ml_kernels/naive_ops.h" +#include "ml_kernels/max.h" `#include` "ml_kernels/relu.h" `#include` "ml_kernels/softmax.h"

Move the MaxV2Benchmark class and its REGISTER_BENCHMARK inside the anonymous namespace (e.g., right after REGISTER_BENCHMARK(MaxBenchmark); at Line 277), and delete the trailing block after main().

Also applies to: 464-464

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ml_kernels/src/kernel_bench.cpp` around lines 407 - 410, Move the stray `#include` "ml_kernels/max.h" and the MaxV2Benchmark class (which inherits MaxBenchmarkBase) along with its REGISTER_BENCHMARK(...) call into the existing anonymous namespace where the other benchmarks live (e.g., place them right after REGISTER_BENCHMARK(MaxBenchmark)); remove the duplicate/trailing MaxV2Benchmark block after main(). Ensure the include is relocated up with the other kernel headers and that MaxV2Benchmark and its registration use the internal anonymous namespace so linkage and organization match the other benchmarks.

coderabbitai · 2026-04-24T20:24:22Z

+    }
+
+private:
+    std::vector<std::vector<float>> inputs_;


⚠️ Potential issue | 🟡 Minor

inputs_ type drifts from MaxBenchmark; also missing bytes_accessed.

MaxBenchmark::inputs_ (Line 194) uses std::vector<AlignedBuffer<float>>, while MaxV2Benchmark::inputs_ uses std::vector<std::vector<float>>. max_v2 uses unaligned loads so this is correct, but the mismatched allocator/alignment makes the GFLOP/s comparison against max_naive less apples-to-apples, and it defeats the opportunity to pull inputs_ (plus setup/verify/teardown) up into MaxBenchmarkBase.

Separately, MaxBenchmark overrides bytes_accessed (Line 189) but MaxV2Benchmark does not, so the bandwidth column will report 0 for max_v2.

🛠️ Minimal fix

- double flops(int n) const override { - return static_cast<double>(n); // 1 comparison per element - } + double bytes_accessed(int n) const override { return n * sizeof(float); } private: - std::vector<std::vector<float>> inputs_; + std::vector<AlignedBuffer<float>> inputs_;

Better still, hoist inputs_/result_/result_ref_/pool_size_/current_idx_ and the shared setup/verify/teardown/bytes_accessed into MaxBenchmarkBase so only name() and run() differ per variant.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ml_kernels/src/kernel_bench.cpp` at line 458, MaxV2Benchmark currently declares inputs_ as std::vector<std::vector<float>> and omits bytes_accessed, causing alignment/allocator mismatch with MaxBenchmark (which uses std::vector<AlignedBuffer<float>>) and a missing bandwidth report; change MaxV2Benchmark to use the same inputs_ type as MaxBenchmark (std::vector<AlignedBuffer<float>>) and implement/override bytes_accessed() consistently; better yet, move shared members (inputs_, result_, result_ref_, pool_size_, current_idx_) and shared lifecycle methods (setup(), verify(), teardown(), bytes_accessed()) into MaxBenchmarkBase and leave only name() and run() implemented in MaxBenchmark and MaxV2Benchmark so both variants use the same aligned buffers and bandwidth accounting.

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

bugparty merged commit 9f350fa into main Apr 24, 2026
8 checks passed

coderabbitai Bot mentioned this pull request Apr 27, 2026

⚡ Thunderbolt: max_v3 — AVX2 Vectorized Max Reduction (8x unroll) #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: max_v2 — AVX2 4x unrolled horizontal reduction#32

⚡ Thunderbolt: max_v2 — AVX2 4x unrolled horizontal reduction#32
bugparty merged 1 commit into
mainfrom
thunderbolt/avx2-max-reduction-11079407733403411578

bugparty commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	## 2024-04-24 - AVX2 Max Reduction Optimization
	## 2026-04-24 - AVX2 Max Reduction Optimization


		#include "ml_kernels/max.h"

		class MaxV2Benchmark : public MaxBenchmarkBase {

Conversation

bugparty commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading