⚡ Thunderbolt: max_v2 — AVX2 4x unrolled horizontal reduction#32
Conversation
Adds `max_v2` implementation in `max.h`, utilizing AVX2 SIMD with 4x unrolling to break loop-carried dependencies present in the naive version. Employs in-register horizontal tree reduction to avoid scalar extraction bottlenecks. Integrates the benchmark into `kernel_bench.cpp`, with a custom MaxBenchmarkBase for accurate GFLOPs reporting. Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughIntroduces AVX2-optimized max reduction implementation with corresponding documentation and benchmarking infrastructure. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
ml_kernels/src/kernel_bench.cpp (1)
453-462: Minor polish: redundantflopsoverride and uninitialized members.
- The
flops(int n)override at Line 453–455 duplicatesMaxBenchmarkBase::flops(Line 143) and can be removed. Same applies to the pre-existingMaxBenchmark::flopsat Line 191 now that the base provides it.result_,result_ref_, andpool_size_(Lines 459–461) lack default initializers, unlikeMaxBenchmark(Lines 195–197). Not a bug sincesetup()always runs first, but worth aligning.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ml_kernels/src/kernel_bench.cpp` around lines 453 - 462, Remove the redundant flops override(s): delete the flops(int n) implementation in this class (and the duplicated MaxBenchmark::flops if present) so the class inherits MaxBenchmarkBase::flops; also add default member initializers for result_, result_ref_ (e.g. = 0.0f) and pool_size_ (e.g. = 0) to match MaxBenchmark's initialization and avoid uninitialized fields—reference the flops method and the members result_, result_ref_, and pool_size_ when making the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.jules/thunderbolt.md:
- Line 19: The entry header "## 2024-04-24 - AVX2 Max Reduction Optimization"
has the wrong year; update the date to the PR creation date by changing that
header to "## 2026-04-24 - AVX2 Max Reduction Optimization" so the thunderbolt
changelog reflects the correct 2026-04-24 date.
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 407-410: Move the stray `#include` "ml_kernels/max.h" and the
MaxV2Benchmark class (which inherits MaxBenchmarkBase) along with its
REGISTER_BENCHMARK(...) call into the existing anonymous namespace where the
other benchmarks live (e.g., place them right after
REGISTER_BENCHMARK(MaxBenchmark)); remove the duplicate/trailing MaxV2Benchmark
block after main(). Ensure the include is relocated up with the other kernel
headers and that MaxV2Benchmark and its registration use the internal anonymous
namespace so linkage and organization match the other benchmarks.
- Line 458: MaxV2Benchmark currently declares inputs_ as
std::vector<std::vector<float>> and omits bytes_accessed, causing
alignment/allocator mismatch with MaxBenchmark (which uses
std::vector<AlignedBuffer<float>>) and a missing bandwidth report; change
MaxV2Benchmark to use the same inputs_ type as MaxBenchmark
(std::vector<AlignedBuffer<float>>) and implement/override bytes_accessed()
consistently; better yet, move shared members (inputs_, result_, result_ref_,
pool_size_, current_idx_) and shared lifecycle methods (setup(), verify(),
teardown(), bytes_accessed()) into MaxBenchmarkBase and leave only name() and
run() implemented in MaxBenchmark and MaxV2Benchmark so both variants use the
same aligned buffers and bandwidth accounting.
---
Nitpick comments:
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 453-462: Remove the redundant flops override(s): delete the
flops(int n) implementation in this class (and the duplicated
MaxBenchmark::flops if present) so the class inherits MaxBenchmarkBase::flops;
also add default member initializers for result_, result_ref_ (e.g. = 0.0f) and
pool_size_ (e.g. = 0) to match MaxBenchmark's initialization and avoid
uninitialized fields—reference the flops method and the members result_,
result_ref_, and pool_size_ when making the changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d9b02119-1b94-47b3-9112-c037e1216acd
📒 Files selected for processing (3)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/max.hml_kernels/src/kernel_bench.cpp
| **Evidence:** Microbenchmarking `exp256_ps` independently with a 4x unroll loop showed Horner's evaluating in 419ms vs. Estrin's 548ms. Integrating this (`exp256_ps_v2`) into `softmax_v5` resulted in a ~13.8% speedup (5.1 GFLOP/s vs `softmax_v4`'s 4.48 GFLOP/s). | ||
|
|
||
| **Action:** When a loop is heavily unrolled to hide FMA latency, default to Horner's scheme rather than Estrin's to reduce instruction count and port pressure. Reserve Estrin's scheme for dependency-bound single-stream calculations. Always use `cvtps_epi32` over `round_ps` if the default MXCSR rounding mode (round-to-nearest) is acceptable. | ||
| ## 2024-04-24 - AVX2 Max Reduction Optimization |
There was a problem hiding this comment.
Date is off by two years.
This PR was created on 2026-04-24, but the new entry is dated 2024-04-24.
-## 2024-04-24 - AVX2 Max Reduction Optimization
+## 2026-04-24 - AVX2 Max Reduction Optimization📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## 2024-04-24 - AVX2 Max Reduction Optimization | |
| ## 2026-04-24 - AVX2 Max Reduction Optimization |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.jules/thunderbolt.md at line 19, The entry header "## 2024-04-24 - AVX2 Max
Reduction Optimization" has the wrong year; update the date to the PR creation
date by changing that header to "## 2026-04-24 - AVX2 Max Reduction
Optimization" so the thunderbolt changelog reflects the correct 2026-04-24 date.
|
|
||
| #include "ml_kernels/max.h" | ||
|
|
||
| class MaxV2Benchmark : public MaxBenchmarkBase { |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Move #include and MaxV2Benchmark into the anonymous namespace at the top of the file.
The #include "ml_kernels/max.h" at Line 408 sits after main(), and MaxV2Benchmark (Line 410) is defined at global scope — all the other benchmarks (MaxBenchmark, SoftmaxV2Benchmark, SoftmaxV3Benchmark, …) live inside the anonymous namespace that closes at Line 335. It happens to compile because MaxBenchmarkBase is referenced via internal linkage and REGISTER_BENCHMARK does static-initialization registration, but the placement is inconsistent with every other registered benchmark and makes the file harder to navigate.
♻️ Suggested organization
Move the include up to the other kernel headers (near Line 16–18):
`#include` "ml_kernels/naive_ops.h"
+#include "ml_kernels/max.h"
`#include` "ml_kernels/relu.h"
`#include` "ml_kernels/softmax.h"Move the MaxV2Benchmark class and its REGISTER_BENCHMARK inside the anonymous namespace (e.g., right after REGISTER_BENCHMARK(MaxBenchmark); at Line 277), and delete the trailing block after main().
Also applies to: 464-464
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ml_kernels/src/kernel_bench.cpp` around lines 407 - 410, Move the stray
`#include` "ml_kernels/max.h" and the MaxV2Benchmark class (which inherits
MaxBenchmarkBase) along with its REGISTER_BENCHMARK(...) call into the existing
anonymous namespace where the other benchmarks live (e.g., place them right
after REGISTER_BENCHMARK(MaxBenchmark)); remove the duplicate/trailing
MaxV2Benchmark block after main(). Ensure the include is relocated up with the
other kernel headers and that MaxV2Benchmark and its registration use the
internal anonymous namespace so linkage and organization match the other
benchmarks.
| } | ||
|
|
||
| private: | ||
| std::vector<std::vector<float>> inputs_; |
There was a problem hiding this comment.
inputs_ type drifts from MaxBenchmark; also missing bytes_accessed.
MaxBenchmark::inputs_ (Line 194) uses std::vector<AlignedBuffer<float>>, while MaxV2Benchmark::inputs_ uses std::vector<std::vector<float>>. max_v2 uses unaligned loads so this is correct, but the mismatched allocator/alignment makes the GFLOP/s comparison against max_naive less apples-to-apples, and it defeats the opportunity to pull inputs_ (plus setup/verify/teardown) up into MaxBenchmarkBase.
Separately, MaxBenchmark overrides bytes_accessed (Line 189) but MaxV2Benchmark does not, so the bandwidth column will report 0 for max_v2.
🛠️ Minimal fix
- double flops(int n) const override {
- return static_cast<double>(n); // 1 comparison per element
- }
+ double bytes_accessed(int n) const override { return n * sizeof(float); }
private:
- std::vector<std::vector<float>> inputs_;
+ std::vector<AlignedBuffer<float>> inputs_;Better still, hoist inputs_/result_/result_ref_/pool_size_/current_idx_ and the shared setup/verify/teardown/bytes_accessed into MaxBenchmarkBase so only name() and run() differ per variant.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ml_kernels/src/kernel_bench.cpp` at line 458, MaxV2Benchmark currently
declares inputs_ as std::vector<std::vector<float>> and omits bytes_accessed,
causing alignment/allocator mismatch with MaxBenchmark (which uses
std::vector<AlignedBuffer<float>>) and a missing bandwidth report; change
MaxV2Benchmark to use the same inputs_ type as MaxBenchmark
(std::vector<AlignedBuffer<float>>) and implement/override bytes_accessed()
consistently; better yet, move shared members (inputs_, result_, result_ref_,
pool_size_, current_idx_) and shared lifecycle methods (setup(), verify(),
teardown(), bytes_accessed()) into MaxBenchmarkBase and leave only name() and
run() implemented in MaxBenchmark and MaxV2Benchmark so both variants use the
same aligned buffers and bandwidth accounting.
💡 What: Added
max_v2, an AVX2-vectorized and 4x unrolled max reduction kernel.🎯 Why: The naive scalar max reduction is bounded by a loop-carried dependency (comparing each element to the running maximum), leading to poor instruction-level parallelism (ILP).
🏗️ How:
_mm256_max_psto process 8 elements per cycle._mm_max_psand shuffles at the end instead of extracting vector lanes to a scalar array.📊 Impact:
🖥️ Tested on: Haswell+ / Zen+ compat (AVX2), GNU Compiler
🔬 How to reproduce:
PR created automatically by Jules for task 11079407733403411578 started by @bugparty
Summary by CodeRabbit
New Features
Documentation
Tests