Skip to content

⚡ Thunderbolt: Softmax — Optimized exp256 range reduction and polynomial eval#31

Merged
bugparty merged 1 commit into
mainfrom
thunderbolt-softmax-v5-18282112880023903289
Apr 24, 2026
Merged

⚡ Thunderbolt: Softmax — Optimized exp256 range reduction and polynomial eval#31
bugparty merged 1 commit into
mainfrom
thunderbolt-softmax-v5-18282112880023903289

Conversation

@bugparty
Copy link
Copy Markdown
Owner

@bugparty bugparty commented Apr 23, 2026

What:
Added a new AVX2 kernel softmax_v5 and a companion exp256_ps_v2 function that optimizes the exponential approximation.

  • Replaced the high-latency _mm256_round_ps instruction with the sequence _mm256_cvtepi32_ps(_mm256_cvtps_epi32(x)) to achieve round-to-nearest-even with lower latency.
  • Switched the Taylor polynomial evaluation from Estrin's scheme back to Horner's scheme.

Why:
While Estrin's scheme breaks the FMA dependency chain for a single exponential evaluation, softmax_v4 is explicitly unrolled 4x. In a 4x unrolled loop, multiple independent Horner FMA chains interleave perfectly, saturating the execution ports and hiding latency naturally. Estrin's scheme in this context creates unnecessary instruction overhead and port pressure, acting as a bottleneck. Additionally, round_ps is a slow instruction.

How:
Implemented exp256_ps_v2 using the cvtps_epi32 rounding trick and Horner's FMA chain, then integrated it into softmax_v5 while maintaining the 4x unroll and shuffle-based horizontal reduction.

Impact:
softmax_v5 achieves ~5.10 GFLOP/s vs softmax_v4's 4.48 GFLOP/s in Fixed Memory mode (N=16384). This is a solid ~13.8% throughput improvement.

Tested on:
Linux / GCC 13.3.0 / AVX2 (CI runner environment).

How to reproduce:

DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter "softmax" --sizes 16384,65536

PR created automatically by Jules for task 18282112880023903289 started by @bugparty

Summary by CodeRabbit

  • Documentation

    • Updated performance optimization guidance with recommendations for vectorized polynomial evaluation techniques.
  • New Features

    • Introduced new softmax implementation achieving ~13.8% performance improvement over previous versions.
  • Tests

    • Added comprehensive test coverage and benchmarks for the new softmax implementation.

…ial eval

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces new optimized AVX2 implementations for the exponential and softmax functions, featuring an improved exp256_ps_v2 that uses Horner polynomial evaluation and direct rounding-to-nearest-even via integer conversion instead of floating-point rounding, and a new softmax_v5 kernel that integrates this optimization with aggressive loop unrolling for improved throughput.

Changes

Cohort / File(s) Summary
Documentation
.jules/thunderbolt.md
Records optimization insights comparing Horner vs. Estrin polynomial evaluation methods and rounding strategies, documenting ~13.8% speedup achieved in softmax_v5.
Core Implementation
ml_kernels/include/ml_kernels/softmax.h
Introduces exp256_ps_v2 using Horner-chain evaluation with cvtps_epi32/cvtepi32_ps rounding, and softmax_v5 kernel with 4-way unrolling and multiple SIMD accumulators for enhanced parallelism.
Benchmarking
ml_kernels/src/kernel_bench.cpp
Adds SoftmaxV5Benchmark class to measure performance of the new softmax_v5 variant against baseline implementations.
Testing
ml_kernels/src/test_naive_ops.cpp
Introduces test_softmax_v5 to validate correctness against naive reference, with element-wise tolerance and output normalization verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 A rabbit hops through vectorized streams,
Horner chains fulfill our SIMD dreams,
Four-way unrolled with accumulators bright,
Integer rounding makes exponents right—
Softmax soars 13.8% to new heights! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: introducing softmax_v5 with optimized exp256 range reduction (switching from round_ps to cvtps_epi32 rounding) and polynomial evaluation (replacing Estrin's with Horner's method), delivering ~13.8% throughput improvement.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch thunderbolt-softmax-v5-18282112880023903289

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
ml_kernels/include/ml_kernels/softmax.h (1)

403-502: Consider templatizing the v3/v4/v5 softmax bodies on the exp functor.

softmax_v5 is byte-for-byte identical to softmax_v4 (and softmax_v3) except for the exp256_ps* call on lines 441–444 and 462. That's ~100 lines duplicated three times; every future bug fix (numerical edge cases, tail-loop bounds, sum==0 handling) has to be applied in three places.

A small template like the following would eliminate the duplication without any runtime cost (calls inline just like today):

♻️ Sketch
template <__m256 (*ExpFn)(__m256)>
inline void softmax_impl(const float *input, float *output, std::size_t n) {
    // ...existing body, calling ExpFn(x) instead of exp256_ps/exp256_ps_estrin/exp256_ps_v2...
}

inline void softmax_v3(const float *in, float *out, std::size_t n) { softmax_impl<exp256_ps>(in, out, n); }
inline void softmax_v4(const float *in, float *out, std::size_t n) { softmax_impl<exp256_ps_estrin>(in, out, n); }
inline void softmax_v5(const float *in, float *out, std::size_t n) { softmax_impl<exp256_ps_v2>(in, out, n); }

Non-blocking since the intent seems to be to preserve each historical variant verbatim for benchmarking/archival, but worth considering as the v-series grows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 403 - 502,
softmax_v3/softmax_v4/softmax_v5 duplicate ~100 lines with only the exp function
differing; refactor by extracting the shared body into a templated helper (e.g.,
softmax_impl) that takes the exp function as a template parameter
(pointer-to-function or functor) and call that from
softmax_v3/softmax_v4/softmax_v5 with exp256_ps, exp256_ps_estrin, and
exp256_ps_v2 respectively; update all internal calls that currently invoke
exp256_ps*/exp (locations around the e0/e1/e2/e3 and scalar tail) to call the
templated ExpFn so behavior remains identical and inlining/no-runtime-cost is
preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 157-166: The test uses a 32-element input vector named input in
test_softmax_v5 which only exercises the 32-wide main loop and never hits the
8-wide tail or scalar remainder in softmax_v5; change the input vector length to
e.g. 41 or 45 (add extra float values) so that the code executes the 32-wide
main loop once, the 8-wide tail (i + 7 < n) at least once, and the scalar
remainder (< n) at least once, leaving test_softmax_v5 and the input variable
name unchanged so the new tail code is covered by the test.

---

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 403-502: softmax_v3/softmax_v4/softmax_v5 duplicate ~100 lines
with only the exp function differing; refactor by extracting the shared body
into a templated helper (e.g., softmax_impl) that takes the exp function as a
template parameter (pointer-to-function or functor) and call that from
softmax_v3/softmax_v4/softmax_v5 with exp256_ps, exp256_ps_estrin, and
exp256_ps_v2 respectively; update all internal calls that currently invoke
exp256_ps*/exp (locations around the e0/e1/e2/e3 and scalar tail) to call the
templated ExpFn so behavior remains identical and inlining/no-runtime-cost is
preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cfa6d598-c9cb-4cfc-ba53-757753fc7193

📥 Commits

Reviewing files that changed from the base of the PR and between 87bf964 and 25f63cf.

📒 Files selected for processing (4)
  • .jules/thunderbolt.md
  • ml_kernels/include/ml_kernels/softmax.h
  • ml_kernels/src/kernel_bench.cpp
  • ml_kernels/src/test_naive_ops.cpp

Comment on lines +157 to +166
std::vector<float> input = {
-2.0f, -0.5f, 1.0f, 3.0f,
0.0f, 0.0f, 0.0f, 0.0f,
100.0f, 100.0f, -100.0f, -100.0f,
5.0f, -5.0f, 2.0f, -2.0f,
1.1f, 1.2f, 1.3f, 1.4f,
-1.1f, -1.2f, -1.3f, -1.4f,
10.0f, 20.0f, 30.0f, 40.0f,
-10.0f, -20.0f, -30.0f, -40.0f
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test size misses the tail loops in softmax_v5.

The input is exactly 32 floats, so execution enters the 32-wide main loop exactly once and both tails (i + 7 < n 8-wide loop and the scalar < n remainder) are never exercised. Given this PR's whole point is new code in the hot path that feeds into those same tails, and given test_softmax_v3/v4 use a 40-element vector (1 iteration of the 8-wide tail), test_softmax_v5 is actually a regression in tail coverage.

Consider bumping the input to a size like 41 or 45 so all three loop phases run at least once:

Proposed additional coverage
     std::vector<float> input = {
         -2.0f, -0.5f, 1.0f, 3.0f,
         0.0f, 0.0f, 0.0f, 0.0f,
         100.0f, 100.0f, -100.0f, -100.0f,
         5.0f, -5.0f, 2.0f, -2.0f,
         1.1f, 1.2f, 1.3f, 1.4f,
         -1.1f, -1.2f, -1.3f, -1.4f,
         10.0f, 20.0f, 30.0f, 40.0f,
-        -10.0f, -20.0f, -30.0f, -40.0f
+        -10.0f, -20.0f, -30.0f, -40.0f,
+        // 8-wide tail
+        0.25f, -0.25f, 0.75f, -0.75f, 1.5f, -1.5f, 2.5f, -2.5f,
+        // scalar tail
+        0.1f, -0.1f, 0.3f
     };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<float> input = {
-2.0f, -0.5f, 1.0f, 3.0f,
0.0f, 0.0f, 0.0f, 0.0f,
100.0f, 100.0f, -100.0f, -100.0f,
5.0f, -5.0f, 2.0f, -2.0f,
1.1f, 1.2f, 1.3f, 1.4f,
-1.1f, -1.2f, -1.3f, -1.4f,
10.0f, 20.0f, 30.0f, 40.0f,
-10.0f, -20.0f, -30.0f, -40.0f
};
std::vector<float> input = {
-2.0f, -0.5f, 1.0f, 3.0f,
0.0f, 0.0f, 0.0f, 0.0f,
100.0f, 100.0f, -100.0f, -100.0f,
5.0f, -5.0f, 2.0f, -2.0f,
1.1f, 1.2f, 1.3f, 1.4f,
-1.1f, -1.2f, -1.3f, -1.4f,
10.0f, 20.0f, 30.0f, 40.0f,
-10.0f, -20.0f, -30.0f, -40.0f,
// 8-wide tail
0.25f, -0.25f, 0.75f, -0.75f, 1.5f, -1.5f, 2.5f, -2.5f,
// scalar tail
0.1f, -0.1f, 0.3f
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 157 - 166, The test uses a
32-element input vector named input in test_softmax_v5 which only exercises the
32-wide main loop and never hits the 8-wide tail or scalar remainder in
softmax_v5; change the input vector length to e.g. 41 or 45 (add extra float
values) so that the code executes the 32-wide main loop once, the 8-wide tail (i
+ 7 < n) at least once, and the scalar remainder (< n) at least once, leaving
test_softmax_v5 and the input variable name unchanged so the new tail code is
covered by the test.

@bugparty bugparty merged commit 88205de into main Apr 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant