Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .jules/thunderbolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,7 @@
**Evidence:** Microbenchmarking `exp256_ps` independently with a 4x unroll loop showed Horner's evaluating in 419ms vs. Estrin's 548ms. Integrating this (`exp256_ps_v2`) into `softmax_v5` resulted in a ~13.8% speedup (5.1 GFLOP/s vs `softmax_v4`'s 4.48 GFLOP/s).

**Action:** When a loop is heavily unrolled to hide FMA latency, default to Horner's scheme rather than Estrin's to reduce instruction count and port pressure. Reserve Estrin's scheme for dependency-bound single-stream calculations. Always use `cvtps_epi32` over `round_ps` if the default MXCSR rounding mode (round-to-nearest) is acceptable.
## 2024-04-24 - AVX2 Max Reduction Optimization
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Date is off by two years.

This PR was created on 2026-04-24, but the new entry is dated 2024-04-24.

-## 2024-04-24 - AVX2 Max Reduction Optimization
+## 2026-04-24 - AVX2 Max Reduction Optimization
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## 2024-04-24 - AVX2 Max Reduction Optimization
## 2026-04-24 - AVX2 Max Reduction Optimization
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.jules/thunderbolt.md at line 19, The entry header "## 2024-04-24 - AVX2 Max
Reduction Optimization" has the wrong year; update the date to the PR creation
date by changing that header to "## 2026-04-24 - AVX2 Max Reduction
Optimization" so the thunderbolt changelog reflects the correct 2026-04-24 date.

**Learning:** The naive scalar max reduction (`max_naive`) suffers from a strict loop-carried dependency (each element must be compared to the running `current_max`), which limits ILP and severely restricts throughput. By unrolling the loop 4x and vectorizing with AVX2 (`_mm256_max_ps`), multiple independent accumulators can be maintained, allowing the processor to compute 32 elements per loop iteration. The final result can be determined efficiently via an in-register horizontal reduction instead of sequentially extracting elements.
**Evidence:** The benchmark `max_v2` achieved ~2.8-2.9 GFLOP/s vs `max_naive`'s ~0.63 GFLOP/s on N=16384000 (a ~4.5x speedup), confirming that breaking the dependency chain hides execution latency.
**Action:** When implementing scalar reductions (e.g., max, sum) over large arrays, prioritize vectorization with 4x-8x unrolling and multiple independent accumulators to break latency bounds, then merge the accumulators via tree-reduction at the end of the hot loop.
62 changes: 62 additions & 0 deletions ml_kernels/include/ml_kernels/max.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#pragma once

#include <cstddef>
#include <limits>
#include <immintrin.h>
#include <algorithm>

namespace ml_kernels {

// ⚡ Thunderbolt: AVX2 Vectorized Max Reduction
// Target: AVX2 (Haswell+)
// Reason: The naive scalar max reduction (max_naive) is bottlenecked by a loop-carried dependency and low ILP.
// Vectorizing it with AVX2 and unrolling 4x allows 32 elements to be processed per iteration across multiple execution ports.
// The final reduction is done efficiently in-register using shuffles, avoiding a scalar extraction loop.
// Expected gain: ~4-5x throughput vs max_naive.
inline float max_v2(const float *input, std::size_t n) {
if (n == 0) return 0.0f;

std::size_t i = 0;
__m256 max_v = _mm256_set1_ps(std::numeric_limits<float>::lowest());
__m256 max0 = max_v, max1 = max_v, max2 = max_v, max3 = max_v;

// Unroll 4x for 32 elements per iteration
for (; i + 31 < n; i += 32) {
max0 = _mm256_max_ps(max0, _mm256_loadu_ps(input + i));
max1 = _mm256_max_ps(max1, _mm256_loadu_ps(input + i + 8));
max2 = _mm256_max_ps(max2, _mm256_loadu_ps(input + i + 16));
max3 = _mm256_max_ps(max3, _mm256_loadu_ps(input + i + 24));
}

// Reduce the 4 vectors into 1
max0 = _mm256_max_ps(max0, max1);
max2 = _mm256_max_ps(max2, max3);
max0 = _mm256_max_ps(max0, max2);

// Remainder loop for multiples of 8 elements
for (; i + 7 < n; i += 8) {
max0 = _mm256_max_ps(max0, _mm256_loadu_ps(input + i));
}

// In-register horizontal reduction
__m128 lo = _mm256_castps256_ps128(max0);
__m128 hi = _mm256_extractf128_ps(max0, 1);
lo = _mm_max_ps(lo, hi);

__m128 shuf = _mm_shuffle_ps(lo, lo, _MM_SHUFFLE(2, 3, 0, 1));
lo = _mm_max_ps(lo, shuf);
shuf = _mm_shuffle_ps(lo, lo, _MM_SHUFFLE(1, 0, 3, 2));
lo = _mm_max_ps(lo, shuf);

float max_val = _mm_cvtss_f32(lo);

// Scalar epilogue
for (; i < n; ++i) {
if (input[i] > max_val) {
max_val = input[i];
}
}
return max_val;
}

} // namespace ml_kernels
65 changes: 64 additions & 1 deletion ml_kernels/src/kernel_bench.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,12 @@ REGISTER_RELU_BENCHMARK(relu_v2_6);
REGISTER_RELU_BENCHMARK(relu_v2_7);
REGISTER_RELU_BENCHMARK(relu_v2_8);

class MaxBenchmark : public BenchmarkBase {
class MaxBenchmarkBase : public BenchmarkBase {
public:
double flops(int n) const override { return static_cast<double>(n); }
};

class MaxBenchmark : public MaxBenchmarkBase {
public:
const char *name() const override { return "max_naive"; }

Expand Down Expand Up @@ -399,3 +404,61 @@ int main(int argc, char **argv) {

return 0;
}

#include "ml_kernels/max.h"

class MaxV2Benchmark : public MaxBenchmarkBase {
Comment on lines +407 to +410
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Move #include and MaxV2Benchmark into the anonymous namespace at the top of the file.

The #include "ml_kernels/max.h" at Line 408 sits after main(), and MaxV2Benchmark (Line 410) is defined at global scope — all the other benchmarks (MaxBenchmark, SoftmaxV2Benchmark, SoftmaxV3Benchmark, …) live inside the anonymous namespace that closes at Line 335. It happens to compile because MaxBenchmarkBase is referenced via internal linkage and REGISTER_BENCHMARK does static-initialization registration, but the placement is inconsistent with every other registered benchmark and makes the file harder to navigate.

♻️ Suggested organization

Move the include up to the other kernel headers (near Line 16–18):

 `#include` "ml_kernels/naive_ops.h"
+#include "ml_kernels/max.h"
 `#include` "ml_kernels/relu.h"
 `#include` "ml_kernels/softmax.h"

Move the MaxV2Benchmark class and its REGISTER_BENCHMARK inside the anonymous namespace (e.g., right after REGISTER_BENCHMARK(MaxBenchmark); at Line 277), and delete the trailing block after main().

Also applies to: 464-464

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/src/kernel_bench.cpp` around lines 407 - 410, Move the stray
`#include` "ml_kernels/max.h" and the MaxV2Benchmark class (which inherits
MaxBenchmarkBase) along with its REGISTER_BENCHMARK(...) call into the existing
anonymous namespace where the other benchmarks live (e.g., place them right
after REGISTER_BENCHMARK(MaxBenchmark)); remove the duplicate/trailing
MaxV2Benchmark block after main(). Ensure the include is relocated up with the
other kernel headers and that MaxV2Benchmark and its registration use the
internal anonymous namespace so linkage and organization match the other
benchmarks.

public:
const char *name() const override { return "max_v2"; }

void setup(int n) override {
size_t bytes_per_iteration = n * sizeof(float);
size_t target_pool_bytes = 100ULL * 1024 * 1024;
pool_size_ = g_use_pool ? std::max<std::size_t>(1, target_pool_bytes / bytes_per_iteration) : 1;

inputs_.resize(pool_size_);
std::mt19937 rng(12345);
std::uniform_real_distribution<float> dist(-4.0f, 4.0f);
for (std::size_t i = 0; i < pool_size_; ++i) {
inputs_[i].resize(n);
for (float &value : inputs_[i]) {
value = dist(rng);
}
}

result_ref_ = inputs_[0].size() == 0
? 0.0f
: *std::max_element(inputs_[0].begin(), inputs_[0].end());
result_ = 0.0f;
current_idx_ = 0;
}

void run() override {
result_ = ml_kernels::max_v2(inputs_[current_idx_].data(), inputs_[current_idx_].size());
current_idx_ = (current_idx_ + 1) % pool_size_;
}

bool verify() override {
current_idx_ = 0;
run();
return std::fabs(result_ - result_ref_) <= 1e-6f;
}

void teardown() override {
inputs_.clear();
result_ = 0.0f;
result_ref_ = 0.0f;
}

double flops(int n) const override {
return static_cast<double>(n); // 1 comparison per element
}

private:
std::vector<std::vector<float>> inputs_;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

inputs_ type drifts from MaxBenchmark; also missing bytes_accessed.

MaxBenchmark::inputs_ (Line 194) uses std::vector<AlignedBuffer<float>>, while MaxV2Benchmark::inputs_ uses std::vector<std::vector<float>>. max_v2 uses unaligned loads so this is correct, but the mismatched allocator/alignment makes the GFLOP/s comparison against max_naive less apples-to-apples, and it defeats the opportunity to pull inputs_ (plus setup/verify/teardown) up into MaxBenchmarkBase.

Separately, MaxBenchmark overrides bytes_accessed (Line 189) but MaxV2Benchmark does not, so the bandwidth column will report 0 for max_v2.

🛠️ Minimal fix
-    double flops(int n) const override {
-        return static_cast<double>(n); // 1 comparison per element
-    }
+    double bytes_accessed(int n) const override { return n * sizeof(float); }

 private:
-    std::vector<std::vector<float>> inputs_;
+    std::vector<AlignedBuffer<float>> inputs_;

Better still, hoist inputs_/result_/result_ref_/pool_size_/current_idx_ and the shared setup/verify/teardown/bytes_accessed into MaxBenchmarkBase so only name() and run() differ per variant.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/src/kernel_bench.cpp` at line 458, MaxV2Benchmark currently
declares inputs_ as std::vector<std::vector<float>> and omits bytes_accessed,
causing alignment/allocator mismatch with MaxBenchmark (which uses
std::vector<AlignedBuffer<float>>) and a missing bandwidth report; change
MaxV2Benchmark to use the same inputs_ type as MaxBenchmark
(std::vector<AlignedBuffer<float>>) and implement/override bytes_accessed()
consistently; better yet, move shared members (inputs_, result_, result_ref_,
pool_size_, current_idx_) and shared lifecycle methods (setup(), verify(),
teardown(), bytes_accessed()) into MaxBenchmarkBase and leave only name() and
run() implemented in MaxBenchmark and MaxV2Benchmark so both variants use the
same aligned buffers and bandwidth accounting.

float result_;
float result_ref_;
std::size_t pool_size_;
std::size_t current_idx_ = 0;
};
REGISTER_BENCHMARK(MaxV2Benchmark);
Loading