Skip to content

⚡ Thunderbolt: max_v2 — AVX2 4x unrolled horizontal reduction#32

Merged
bugparty merged 1 commit into
mainfrom
thunderbolt/avx2-max-reduction-11079407733403411578
Apr 24, 2026
Merged

⚡ Thunderbolt: max_v2 — AVX2 4x unrolled horizontal reduction#32
bugparty merged 1 commit into
mainfrom
thunderbolt/avx2-max-reduction-11079407733403411578

Conversation

@bugparty
Copy link
Copy Markdown
Owner

@bugparty bugparty commented Apr 24, 2026

💡 What: Added max_v2, an AVX2-vectorized and 4x unrolled max reduction kernel.
🎯 Why: The naive scalar max reduction is bounded by a loop-carried dependency (comparing each element to the running maximum), leading to poor instruction-level parallelism (ILP).
🏗️ How:

  • Used _mm256_max_ps to process 8 elements per cycle.
  • Unrolled the main loop 4x to maintain 4 independent accumulator vectors, hiding latency and saturating execution ports.
  • Performed an in-register horizontal tree reduction using _mm_max_ps and shuffles at the end instead of extracting vector lanes to a scalar array.
    📊 Impact:
  • Before: ~0.63 GFLOP/s (N=16,384,000)
  • After: ~2.93 GFLOP/s (N=16,384,000)
  • Speedup: ~4.6x
    🖥️ Tested on: Haswell+ / Zen+ compat (AVX2), GNU Compiler
    🔬 How to reproduce:
make -C build -j4
DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter max --sizes 16384000

PR created automatically by Jules for task 11079407733403411578 started by @bugparty

Summary by CodeRabbit

  • New Features

    • Added optimized maximum value computation achieving ~4.5x throughput improvement.
  • Documentation

    • Added guidance for array reduction optimization techniques and benchmarking results.
  • Tests

    • Extended benchmarking suite to validate new optimized functions.

Adds `max_v2` implementation in `max.h`, utilizing AVX2 SIMD with 4x unrolling to break loop-carried dependencies present in the naive version. Employs in-register horizontal tree reduction to avoid scalar extraction bottlenecks. Integrates the benchmark into `kernel_bench.cpp`, with a custom MaxBenchmarkBase for accurate GFLOPs reporting.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

Introduces AVX2-optimized max reduction implementation with corresponding documentation and benchmarking infrastructure. The max_v2 function processes float arrays using parallel vector accumulators and horizontal in-register reduction, achieving significant throughput improvements over scalar approaches.

Changes

Cohort / File(s) Summary
Documentation
.jules/thunderbolt.md
Added guidance for AVX2 max reduction optimization with benchmark results showing ~4.5x throughput improvement using vectorization and multiple accumulators.
Core Implementation
ml_kernels/include/ml_kernels/max.h
New header defining max_v2 function using AVX2 intrinsics: 4-wide unrolled vector loads with independent accumulators, in-register horizontal max reduction via shuffle operations, plus scalar/vector remainder loops.
Benchmark Infrastructure
ml_kernels/src/kernel_bench.cpp
Refactored MaxBenchmark into base class, added MaxV2Benchmark with pooled randomized inputs (~100MB), reference computation, round-robin execution, and epsilon-based correctness verification (≤1e-6 error).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Four vectors dancing in perfect stride,
Shuffling maximums far and wide,
AVX2 whispers through the lanes,
Where throughput flows through silicon veins,
A speedy hop to victory's gain! ⚡

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly matches the main changeset: adds max_v2, an AVX2-optimized kernel with 4x loop unrolling and horizontal reduction.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt/avx2-max-reduction-11079407733403411578

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
ml_kernels/src/kernel_bench.cpp (1)

453-462: Minor polish: redundant flops override and uninitialized members.

  • The flops(int n) override at Line 453–455 duplicates MaxBenchmarkBase::flops (Line 143) and can be removed. Same applies to the pre-existing MaxBenchmark::flops at Line 191 now that the base provides it.
  • result_, result_ref_, and pool_size_ (Lines 459–461) lack default initializers, unlike MaxBenchmark (Lines 195–197). Not a bug since setup() always runs first, but worth aligning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/src/kernel_bench.cpp` around lines 453 - 462, Remove the redundant
flops override(s): delete the flops(int n) implementation in this class (and the
duplicated MaxBenchmark::flops if present) so the class inherits
MaxBenchmarkBase::flops; also add default member initializers for result_,
result_ref_ (e.g. = 0.0f) and pool_size_ (e.g. = 0) to match MaxBenchmark's
initialization and avoid uninitialized fields—reference the flops method and the
members result_, result_ref_, and pool_size_ when making the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.jules/thunderbolt.md:
- Line 19: The entry header "## 2024-04-24 - AVX2 Max Reduction Optimization"
has the wrong year; update the date to the PR creation date by changing that
header to "## 2026-04-24 - AVX2 Max Reduction Optimization" so the thunderbolt
changelog reflects the correct 2026-04-24 date.

In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 407-410: Move the stray `#include` "ml_kernels/max.h" and the
MaxV2Benchmark class (which inherits MaxBenchmarkBase) along with its
REGISTER_BENCHMARK(...) call into the existing anonymous namespace where the
other benchmarks live (e.g., place them right after
REGISTER_BENCHMARK(MaxBenchmark)); remove the duplicate/trailing MaxV2Benchmark
block after main(). Ensure the include is relocated up with the other kernel
headers and that MaxV2Benchmark and its registration use the internal anonymous
namespace so linkage and organization match the other benchmarks.
- Line 458: MaxV2Benchmark currently declares inputs_ as
std::vector<std::vector<float>> and omits bytes_accessed, causing
alignment/allocator mismatch with MaxBenchmark (which uses
std::vector<AlignedBuffer<float>>) and a missing bandwidth report; change
MaxV2Benchmark to use the same inputs_ type as MaxBenchmark
(std::vector<AlignedBuffer<float>>) and implement/override bytes_accessed()
consistently; better yet, move shared members (inputs_, result_, result_ref_,
pool_size_, current_idx_) and shared lifecycle methods (setup(), verify(),
teardown(), bytes_accessed()) into MaxBenchmarkBase and leave only name() and
run() implemented in MaxBenchmark and MaxV2Benchmark so both variants use the
same aligned buffers and bandwidth accounting.

---

Nitpick comments:
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 453-462: Remove the redundant flops override(s): delete the
flops(int n) implementation in this class (and the duplicated
MaxBenchmark::flops if present) so the class inherits MaxBenchmarkBase::flops;
also add default member initializers for result_, result_ref_ (e.g. = 0.0f) and
pool_size_ (e.g. = 0) to match MaxBenchmark's initialization and avoid
uninitialized fields—reference the flops method and the members result_,
result_ref_, and pool_size_ when making the changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9b02119-1b94-47b3-9112-c037e1216acd

📥 Commits

Reviewing files that changed from the base of the PR and between 88205de and 55c046b.

📒 Files selected for processing (3)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/max.h
  • ml_kernels/src/kernel_bench.cpp

Comment thread .jules/thunderbolt.md
**Evidence:** Microbenchmarking `exp256_ps` independently with a 4x unroll loop showed Horner's evaluating in 419ms vs. Estrin's 548ms. Integrating this (`exp256_ps_v2`) into `softmax_v5` resulted in a ~13.8% speedup (5.1 GFLOP/s vs `softmax_v4`'s 4.48 GFLOP/s).

**Action:** When a loop is heavily unrolled to hide FMA latency, default to Horner's scheme rather than Estrin's to reduce instruction count and port pressure. Reserve Estrin's scheme for dependency-bound single-stream calculations. Always use `cvtps_epi32` over `round_ps` if the default MXCSR rounding mode (round-to-nearest) is acceptable.
## 2024-04-24 - AVX2 Max Reduction Optimization
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Date is off by two years.

This PR was created on 2026-04-24, but the new entry is dated 2024-04-24.

-## 2024-04-24 - AVX2 Max Reduction Optimization
+## 2026-04-24 - AVX2 Max Reduction Optimization
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## 2024-04-24 - AVX2 Max Reduction Optimization
## 2026-04-24 - AVX2 Max Reduction Optimization
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.jules/thunderbolt.md at line 19, The entry header "## 2024-04-24 - AVX2 Max
Reduction Optimization" has the wrong year; update the date to the PR creation
date by changing that header to "## 2026-04-24 - AVX2 Max Reduction
Optimization" so the thunderbolt changelog reflects the correct 2026-04-24 date.

Comment on lines +407 to +410

#include "ml_kernels/max.h"

class MaxV2Benchmark : public MaxBenchmarkBase {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Move #include and MaxV2Benchmark into the anonymous namespace at the top of the file.

The #include "ml_kernels/max.h" at Line 408 sits after main(), and MaxV2Benchmark (Line 410) is defined at global scope — all the other benchmarks (MaxBenchmark, SoftmaxV2Benchmark, SoftmaxV3Benchmark, …) live inside the anonymous namespace that closes at Line 335. It happens to compile because MaxBenchmarkBase is referenced via internal linkage and REGISTER_BENCHMARK does static-initialization registration, but the placement is inconsistent with every other registered benchmark and makes the file harder to navigate.

♻️ Suggested organization

Move the include up to the other kernel headers (near Line 16–18):

 `#include` "ml_kernels/naive_ops.h"
+#include "ml_kernels/max.h"
 `#include` "ml_kernels/relu.h"
 `#include` "ml_kernels/softmax.h"

Move the MaxV2Benchmark class and its REGISTER_BENCHMARK inside the anonymous namespace (e.g., right after REGISTER_BENCHMARK(MaxBenchmark); at Line 277), and delete the trailing block after main().

Also applies to: 464-464

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/src/kernel_bench.cpp` around lines 407 - 410, Move the stray
`#include` "ml_kernels/max.h" and the MaxV2Benchmark class (which inherits
MaxBenchmarkBase) along with its REGISTER_BENCHMARK(...) call into the existing
anonymous namespace where the other benchmarks live (e.g., place them right
after REGISTER_BENCHMARK(MaxBenchmark)); remove the duplicate/trailing
MaxV2Benchmark block after main(). Ensure the include is relocated up with the
other kernel headers and that MaxV2Benchmark and its registration use the
internal anonymous namespace so linkage and organization match the other
benchmarks.

}

private:
std::vector<std::vector<float>> inputs_;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

inputs_ type drifts from MaxBenchmark; also missing bytes_accessed.

MaxBenchmark::inputs_ (Line 194) uses std::vector<AlignedBuffer<float>>, while MaxV2Benchmark::inputs_ uses std::vector<std::vector<float>>. max_v2 uses unaligned loads so this is correct, but the mismatched allocator/alignment makes the GFLOP/s comparison against max_naive less apples-to-apples, and it defeats the opportunity to pull inputs_ (plus setup/verify/teardown) up into MaxBenchmarkBase.

Separately, MaxBenchmark overrides bytes_accessed (Line 189) but MaxV2Benchmark does not, so the bandwidth column will report 0 for max_v2.

🛠️ Minimal fix
-    double flops(int n) const override {
-        return static_cast<double>(n); // 1 comparison per element
-    }
+    double bytes_accessed(int n) const override { return n * sizeof(float); }

 private:
-    std::vector<std::vector<float>> inputs_;
+    std::vector<AlignedBuffer<float>> inputs_;

Better still, hoist inputs_/result_/result_ref_/pool_size_/current_idx_ and the shared setup/verify/teardown/bytes_accessed into MaxBenchmarkBase so only name() and run() differ per variant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/src/kernel_bench.cpp` at line 458, MaxV2Benchmark currently
declares inputs_ as std::vector<std::vector<float>> and omits bytes_accessed,
causing alignment/allocator mismatch with MaxBenchmark (which uses
std::vector<AlignedBuffer<float>>) and a missing bandwidth report; change
MaxV2Benchmark to use the same inputs_ type as MaxBenchmark
(std::vector<AlignedBuffer<float>>) and implement/override bytes_accessed()
consistently; better yet, move shared members (inputs_, result_, result_ref_,
pool_size_, current_idx_) and shared lifecycle methods (setup(), verify(),
teardown(), bytes_accessed()) into MaxBenchmarkBase and leave only name() and
run() implemented in MaxBenchmark and MaxV2Benchmark so both variants use the
same aligned buffers and bandwidth accounting.

@bugparty bugparty merged commit 9f350fa into main Apr 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant