Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .jules/thunderbolt.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,10 @@
**Learning:** When writing AVX2 softmax kernels, standard vector loops can be bottlenecked by instruction latency, particularly for heavy exponential approximations (`exp256_ps`). Unrolling the loop by 4x and using independent accumulators hides this latency. Furthermore, performing the final sum/max reduction in-register via `_mm256_permute2f128_ps` and `_mm256_shuffle_ps` completely eliminates the scalar bottleneck of extracting to an array, improving throughput significantly.
**Evidence:** `softmax_v3` achieved 5.79 GFLOP/s vs `softmax_v2` at 4.15 GFLOP/s on N=4096 (Fixed Memory mode), a ~39% performance gain.
**Action:** Default to 4x unrolling and in-register horizontal reductions via shuffle when writing bound AVX2 map-reduce kernels instead of vectorizing single loop iterations and dropping to scalar array reductions at the end.
## 2024-10-25 - AVX2 Softmax Unrolling vs. Polynomial Evaluation Schemes

**Learning:** When vectorizing transcendental functions like `exp` in AVX2, breaking the FMA latency chain can be done via Estrin's scheme. However, if the main loop is already explicitly unrolled (e.g., 4x) to interleave multiple independent FMA streams, Horner's method actually outperforms Estrin's. Estrin's scheme creates higher execution port pressure, whereas 4x interleaved Horner's chains saturate the execution units perfectly while naturally hiding the latency. Additionally, replacing the high-latency `_mm256_round_ps` instruction with the sequence `_mm256_cvtepi32_ps(_mm256_cvtps_epi32(x))` achieves round-to-nearest-even with lower latency, improving throughput in range reduction.

**Evidence:** Microbenchmarking `exp256_ps` independently with a 4x unroll loop showed Horner's evaluating in 419ms vs. Estrin's 548ms. Integrating this (`exp256_ps_v2`) into `softmax_v5` resulted in a ~13.8% speedup (5.1 GFLOP/s vs `softmax_v4`'s 4.48 GFLOP/s).

**Action:** When a loop is heavily unrolled to hide FMA latency, default to Horner's scheme rather than Estrin's to reduce instruction count and port pressure. Reserve Estrin's scheme for dependency-bound single-stream calculations. Always use `cvtps_epi32` over `round_ps` if the default MXCSR rounding mode (round-to-nearest) is acceptable.
138 changes: 138 additions & 0 deletions ml_kernels/include/ml_kernels/softmax.h
Original file line number Diff line number Diff line change
Expand Up @@ -363,4 +363,142 @@ inline void softmax_v4(const float *input, float *output, std::size_t n) {
}
}

inline __m256 exp256_ps_v2(__m256 x) {
x = _mm256_max_ps(x, _mm256_set1_ps(-87.3f));
__m256 x_log2e = _mm256_mul_ps(x, _mm256_set1_ps(1.4426950408889634f));

// cvtps_epi32 defaults to round-to-nearest in AVX2, avoiding round_ps
__m256i n_int = _mm256_cvtps_epi32(x_log2e);
__m256 n = _mm256_cvtepi32_ps(n_int);

// Use fnmadd to do r = x - n*ln2
__m256 r = _mm256_fnmadd_ps(n, _mm256_set1_ps(0.693145751953125f), x);
r = _mm256_fnmadd_ps(n, _mm256_set1_ps(1.428606765330187e-06f), r);

// Horner's scheme instead of Estrin
__m256 c1 = _mm256_set1_ps(1.0f);
__m256 c2 = _mm256_set1_ps(1.0f / 2.0f);
__m256 c3 = _mm256_set1_ps(1.0f / 6.0f);
__m256 c4 = _mm256_set1_ps(1.0f / 24.0f);
__m256 c5 = _mm256_set1_ps(1.0f / 120.0f);

__m256 p = _mm256_fmadd_ps(c5, r, c4);
p = _mm256_fmadd_ps(p, r, c3);
p = _mm256_fmadd_ps(p, r, c2);
p = _mm256_fmadd_ps(p, r, c1);
p = _mm256_fmadd_ps(p, r, c1);

__m256i exp_shift = _mm256_add_epi32(n_int, _mm256_set1_epi32(127));
__m256i exp_shifted = _mm256_slli_epi32(exp_shift, 23);
__m256 exp2n = _mm256_castsi256_ps(exp_shifted);

return _mm256_mul_ps(p, exp2n);
}

// ⚡ Thunderbolt: AVX2 Vectorized Softmax with FMA-optimized exp256
// Target: AVX2 (Haswell+)
// Reason: Avoids `round_ps` by leveraging `cvtps_epi32` rounding mode, and replaces Estrin's scheme with Horner's.
// When unrolled 4x, the independent Horner chains interleave perfectly, saturating execution ports and hiding latency better than Estrin, leading to higher throughput.
// Expected gain: ~15-25% over softmax_v4.
inline void softmax_v5(const float *input, float *output, std::size_t n) {
if (n == 0) return;

// 1. Find max
std::size_t i = 0;
__m256 max_v = _mm256_set1_ps(std::numeric_limits<float>::lowest());
__m256 max0 = max_v, max1 = max_v, max2 = max_v, max3 = max_v;

for (; i + 31 < n; i += 32) {
max0 = _mm256_max_ps(max0, _mm256_loadu_ps(input + i));
max1 = _mm256_max_ps(max1, _mm256_loadu_ps(input + i + 8));
max2 = _mm256_max_ps(max2, _mm256_loadu_ps(input + i + 16));
max3 = _mm256_max_ps(max3, _mm256_loadu_ps(input + i + 24));
}
max0 = _mm256_max_ps(max0, max1);
max2 = _mm256_max_ps(max2, max3);
max0 = _mm256_max_ps(max0, max2);
for (; i + 7 < n; i += 8) {
max0 = _mm256_max_ps(max0, _mm256_loadu_ps(input + i));
}
float max_val = reduce_max(max0);
for (; i < n; ++i) max_val = std::max(max_val, input[i]);

__m256 max_vec = _mm256_set1_ps(max_val);

// 2. Compute exp and sum
i = 0;
__m256 sum0 = _mm256_setzero_ps();
__m256 sum1 = _mm256_setzero_ps();
__m256 sum2 = _mm256_setzero_ps();
__m256 sum3 = _mm256_setzero_ps();

for (; i + 31 < n; i += 32) {
__m256 x0 = _mm256_sub_ps(_mm256_loadu_ps(input + i), max_vec);
__m256 x1 = _mm256_sub_ps(_mm256_loadu_ps(input + i + 8), max_vec);
__m256 x2 = _mm256_sub_ps(_mm256_loadu_ps(input + i + 16), max_vec);
__m256 x3 = _mm256_sub_ps(_mm256_loadu_ps(input + i + 24), max_vec);

__m256 e0 = exp256_ps_v2(x0);
__m256 e1 = exp256_ps_v2(x1);
__m256 e2 = exp256_ps_v2(x2);
__m256 e3 = exp256_ps_v2(x3);

_mm256_storeu_ps(output + i, e0);
_mm256_storeu_ps(output + i + 8, e1);
_mm256_storeu_ps(output + i + 16, e2);
_mm256_storeu_ps(output + i + 24, e3);

sum0 = _mm256_add_ps(sum0, e0);
sum1 = _mm256_add_ps(sum1, e1);
sum2 = _mm256_add_ps(sum2, e2);
sum3 = _mm256_add_ps(sum3, e3);
}
sum0 = _mm256_add_ps(sum0, sum1);
sum2 = _mm256_add_ps(sum2, sum3);
sum0 = _mm256_add_ps(sum0, sum2);

for (; i + 7 < n; i += 8) {
__m256 x = _mm256_loadu_ps(input + i);
__m256 e = exp256_ps_v2(_mm256_sub_ps(x, max_vec));
_mm256_storeu_ps(output + i, e);
sum0 = _mm256_add_ps(sum0, e);
}

float sum_val = reduce_sum(sum0);
for (; i < n; ++i) {
float e = std::exp(input[i] - max_val);
output[i] = e;
sum_val += e;
}

if (sum_val == 0.0f) return;

// 3. Normalize
float inv_sum = 1.0f / sum_val;
__m256 inv_sum_v = _mm256_set1_ps(inv_sum);
i = 0;
for (; i + 31 < n; i += 32) {
__m256 o0 = _mm256_loadu_ps(output + i);
__m256 o1 = _mm256_loadu_ps(output + i + 8);
__m256 o2 = _mm256_loadu_ps(output + i + 16);
__m256 o3 = _mm256_loadu_ps(output + i + 24);

__m256 m0 = _mm256_mul_ps(o0, inv_sum_v);
__m256 m1 = _mm256_mul_ps(o1, inv_sum_v);
__m256 m2 = _mm256_mul_ps(o2, inv_sum_v);
__m256 m3 = _mm256_mul_ps(o3, inv_sum_v);

_mm256_storeu_ps(output + i, m0);
_mm256_storeu_ps(output + i + 8, m1);
_mm256_storeu_ps(output + i + 16, m2);
_mm256_storeu_ps(output + i + 24, m3);
}
for (; i + 7 < n; i += 8) {
_mm256_storeu_ps(output + i, _mm256_mul_ps(_mm256_loadu_ps(output + i), inv_sum_v));
}
for (; i < n; ++i) {
output[i] *= inv_sum;
}
}

} // namespace ml_kernels
11 changes: 11 additions & 0 deletions ml_kernels/src/kernel_bench.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,17 @@ class SoftmaxV4Benchmark : public SoftmaxBenchmark {
};
REGISTER_BENCHMARK(SoftmaxV4Benchmark);

class SoftmaxV5Benchmark : public SoftmaxBenchmark {
public:
const char *name() const override { return "softmax_v5"; }

void run() override {
ml_kernels::softmax_v5(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size());
current_idx_ = (current_idx_ + 1) % pool_size_;
}
};
REGISTER_BENCHMARK(SoftmaxV5Benchmark);

} // namespace

int main(int argc, char **argv) {
Expand Down
30 changes: 30 additions & 0 deletions ml_kernels/src/test_naive_ops.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -152,10 +152,40 @@ void test_softmax_v4() {
std::cout << "test_softmax_v4 passed!" << std::endl;
}

void test_softmax_v5() {
std::cout << "Running test_softmax_v5..." << std::endl;
std::vector<float> input = {
-2.0f, -0.5f, 1.0f, 3.0f,
0.0f, 0.0f, 0.0f, 0.0f,
100.0f, 100.0f, -100.0f, -100.0f,
5.0f, -5.0f, 2.0f, -2.0f,
1.1f, 1.2f, 1.3f, 1.4f,
-1.1f, -1.2f, -1.3f, -1.4f,
10.0f, 20.0f, 30.0f, 40.0f,
-10.0f, -20.0f, -30.0f, -40.0f
};
Comment on lines +157 to +166
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test size misses the tail loops in softmax_v5.

The input is exactly 32 floats, so execution enters the 32-wide main loop exactly once and both tails (i + 7 < n 8-wide loop and the scalar < n remainder) are never exercised. Given this PR's whole point is new code in the hot path that feeds into those same tails, and given test_softmax_v3/v4 use a 40-element vector (1 iteration of the 8-wide tail), test_softmax_v5 is actually a regression in tail coverage.

Consider bumping the input to a size like 41 or 45 so all three loop phases run at least once:

Proposed additional coverage
     std::vector<float> input = {
         -2.0f, -0.5f, 1.0f, 3.0f,
         0.0f, 0.0f, 0.0f, 0.0f,
         100.0f, 100.0f, -100.0f, -100.0f,
         5.0f, -5.0f, 2.0f, -2.0f,
         1.1f, 1.2f, 1.3f, 1.4f,
         -1.1f, -1.2f, -1.3f, -1.4f,
         10.0f, 20.0f, 30.0f, 40.0f,
-        -10.0f, -20.0f, -30.0f, -40.0f
+        -10.0f, -20.0f, -30.0f, -40.0f,
+        // 8-wide tail
+        0.25f, -0.25f, 0.75f, -0.75f, 1.5f, -1.5f, 2.5f, -2.5f,
+        // scalar tail
+        0.1f, -0.1f, 0.3f
     };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<float> input = {
-2.0f, -0.5f, 1.0f, 3.0f,
0.0f, 0.0f, 0.0f, 0.0f,
100.0f, 100.0f, -100.0f, -100.0f,
5.0f, -5.0f, 2.0f, -2.0f,
1.1f, 1.2f, 1.3f, 1.4f,
-1.1f, -1.2f, -1.3f, -1.4f,
10.0f, 20.0f, 30.0f, 40.0f,
-10.0f, -20.0f, -30.0f, -40.0f
};
std::vector<float> input = {
-2.0f, -0.5f, 1.0f, 3.0f,
0.0f, 0.0f, 0.0f, 0.0f,
100.0f, 100.0f, -100.0f, -100.0f,
5.0f, -5.0f, 2.0f, -2.0f,
1.1f, 1.2f, 1.3f, 1.4f,
-1.1f, -1.2f, -1.3f, -1.4f,
10.0f, 20.0f, 30.0f, 40.0f,
-10.0f, -20.0f, -30.0f, -40.0f,
// 8-wide tail
0.25f, -0.25f, 0.75f, -0.75f, 1.5f, -1.5f, 2.5f, -2.5f,
// scalar tail
0.1f, -0.1f, 0.3f
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 157 - 166, The test uses a
32-element input vector named input in test_softmax_v5 which only exercises the
32-wide main loop and never hits the 8-wide tail or scalar remainder in
softmax_v5; change the input vector length to e.g. 41 or 45 (add extra float
values) so that the code executes the 32-wide main loop once, the 8-wide tail (i
+ 7 < n) at least once, and the scalar remainder (< n) at least once, leaving
test_softmax_v5 and the input variable name unchanged so the new tail code is
covered by the test.


std::vector<float> output_naive(input.size(), 0.0f);
std::vector<float> output_v5(input.size(), 0.0f);

ml_kernels::softmax_naive(input.data(), output_naive.data(), input.size());
ml_kernels::softmax_v5(input.data(), output_v5.data(), input.size());

float sum = 0.0f;
for (std::size_t i = 0; i < input.size(); ++i) {
assert(std::fabs(output_naive[i] - output_v5[i]) < 1e-4f);
sum += output_v5[i];
}
assert(std::fabs(sum - 1.0f) < 1e-4f);

std::cout << "test_softmax_v5 passed!" << std::endl;
}

int main() {
test_relu_naive();
test_max_naive();
test_softmax_v3();
test_softmax_v4();
test_softmax_v5();
std::cout << "All tests passed successfully!" << std::endl;
}
Loading