⚡ Thunderbolt: Softmax — Optimized exp256 range reduction and polynomial eval by bugparty · Pull Request #31 · bugparty/cpu_math_kernels_pri

bugparty · 2026-04-23T20:19:09Z

What:
Added a new AVX2 kernel softmax_v5 and a companion exp256_ps_v2 function that optimizes the exponential approximation.

Replaced the high-latency _mm256_round_ps instruction with the sequence _mm256_cvtepi32_ps(_mm256_cvtps_epi32(x)) to achieve round-to-nearest-even with lower latency.
Switched the Taylor polynomial evaluation from Estrin's scheme back to Horner's scheme.

Why:
While Estrin's scheme breaks the FMA dependency chain for a single exponential evaluation, softmax_v4 is explicitly unrolled 4x. In a 4x unrolled loop, multiple independent Horner FMA chains interleave perfectly, saturating the execution ports and hiding latency naturally. Estrin's scheme in this context creates unnecessary instruction overhead and port pressure, acting as a bottleneck. Additionally, round_ps is a slow instruction.

How:
Implemented exp256_ps_v2 using the cvtps_epi32 rounding trick and Horner's FMA chain, then integrated it into softmax_v5 while maintaining the 4x unroll and shuffle-based horizontal reduction.

Impact:
softmax_v5 achieves ~5.10 GFLOP/s vs softmax_v4's 4.48 GFLOP/s in Fixed Memory mode (N=16384). This is a solid ~13.8% throughput improvement.

Tested on:
Linux / GCC 13.3.0 / AVX2 (CI runner environment).

How to reproduce:

DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter "softmax" --sizes 16384,65536

PR created automatically by Jules for task 18282112880023903289 started by @bugparty

Summary by CodeRabbit

Documentation
- Updated performance optimization guidance with recommendations for vectorized polynomial evaluation techniques.
New Features
- Introduced new softmax implementation achieving ~13.8% performance improvement over previous versions.
Tests
- Added comprehensive test coverage and benchmarks for the new softmax implementation.

…ial eval Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-04-23T20:19:10Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-04-23T20:19:23Z

📝 Walkthrough

Walkthrough

This PR introduces new optimized AVX2 implementations for the exponential and softmax functions, featuring an improved exp256_ps_v2 that uses Horner polynomial evaluation and direct rounding-to-nearest-even via integer conversion instead of floating-point rounding, and a new softmax_v5 kernel that integrates this optimization with aggressive loop unrolling for improved throughput.

Changes

Cohort / File(s)	Summary
Documentation `.jules/thunderbolt.md`	Records optimization insights comparing Horner vs. Estrin polynomial evaluation methods and rounding strategies, documenting ~13.8% speedup achieved in softmax_v5.
Core Implementation `ml_kernels/include/ml_kernels/softmax.h`	Introduces `exp256_ps_v2` using Horner-chain evaluation with `cvtps_epi32`/`cvtepi32_ps` rounding, and `softmax_v5` kernel with 4-way unrolling and multiple SIMD accumulators for enhanced parallelism.
Benchmarking `ml_kernels/src/kernel_bench.cpp`	Adds `SoftmaxV5Benchmark` class to measure performance of the new softmax_v5 variant against baseline implementations.
Testing `ml_kernels/src/test_naive_ops.cpp`	Introduces `test_softmax_v5` to validate correctness against naive reference, with element-wise tolerance and output normalization verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

⚡ Thunderbolt: ml_kernels/softmax — Replace Horner's method with Estrin's scheme in exp256_ps #28: Directly related; introduces Estrin-based exp256_ps_estrin/softmax_v4 variant, while this PR implements competing Horner-based exp256_ps_v2/softmax_v5 with different rounding mechanism in the same header file.

Poem

🐰 A rabbit hops through vectorized streams,
Horner chains fulfill our SIMD dreams,
Four-way unrolled with accumulators bright,
Integer rounding makes exponents right—
Softmax soars 13.8% to new heights! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: introducing `softmax_v5` with optimized exp256 range reduction (switching from `round_ps` to `cvtps_epi32` rounding) and polynomial evaluation (replacing Estrin's with Horner's method), delivering ~13.8% throughput improvement.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-v5-18282112880023903289

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

ml_kernels/include/ml_kernels/softmax.h (1)
403-502: Consider templatizing the v3/v4/v5 softmax bodies on the exp functor.

softmax_v5 is byte-for-byte identical to softmax_v4 (and softmax_v3) except for the exp256_ps* call on lines 441–444 and 462. That's ~100 lines duplicated three times; every future bug fix (numerical edge cases, tail-loop bounds, sum==0 handling) has to be applied in three places.

A small template like the following would eliminate the duplication without any runtime cost (calls inline just like today):
♻️ Sketch
template <__m256 (*ExpFn)(__m256)>
inline void softmax_impl(const float *input, float *output, std::size_t n) {
    // ...existing body, calling ExpFn(x) instead of exp256_ps/exp256_ps_estrin/exp256_ps_v2...
}

inline void softmax_v3(const float *in, float *out, std::size_t n) { softmax_impl<exp256_ps>(in, out, n); }
inline void softmax_v4(const float *in, float *out, std::size_t n) { softmax_impl<exp256_ps_estrin>(in, out, n); }
inline void softmax_v5(const float *in, float *out, std::size_t n) { softmax_impl<exp256_ps_v2>(in, out, n); }
Non-blocking since the intent seems to be to preserve each historical variant verbatim for benchmarking/archival, but worth considering as the v-series grows.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 403 - 502,
softmax_v3/softmax_v4/softmax_v5 duplicate ~100 lines with only the exp function
differing; refactor by extracting the shared body into a templated helper (e.g.,
softmax_impl) that takes the exp function as a template parameter
(pointer-to-function or functor) and call that from
softmax_v3/softmax_v4/softmax_v5 with exp256_ps, exp256_ps_estrin, and
exp256_ps_v2 respectively; update all internal calls that currently invoke
exp256_ps*/exp (locations around the e0/e1/e2/e3 and scalar tail) to call the
templated ExpFn so behavior remains identical and inlining/no-runtime-cost is
preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 157-166: The test uses a 32-element input vector named input in
test_softmax_v5 which only exercises the 32-wide main loop and never hits the
8-wide tail or scalar remainder in softmax_v5; change the input vector length to
e.g. 41 or 45 (add extra float values) so that the code executes the 32-wide
main loop once, the 8-wide tail (i + 7 < n) at least once, and the scalar
remainder (< n) at least once, leaving test_softmax_v5 and the input variable
name unchanged so the new tail code is covered by the test.

---

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 403-502: softmax_v3/softmax_v4/softmax_v5 duplicate ~100 lines
with only the exp function differing; refactor by extracting the shared body
into a templated helper (e.g., softmax_impl) that takes the exp function as a
template parameter (pointer-to-function or functor) and call that from
softmax_v3/softmax_v4/softmax_v5 with exp256_ps, exp256_ps_estrin, and
exp256_ps_v2 respectively; update all internal calls that currently invoke
exp256_ps*/exp (locations around the e0/e1/e2/e3 and scalar tail) to call the
templated ExpFn so behavior remains identical and inlining/no-runtime-cost is
preserved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cfa6d598-c9cb-4cfc-ba53-757753fc7193

📥 Commits

Reviewing files that changed from the base of the PR and between 87bf964 and 25f63cf.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai · 2026-04-23T20:22:06Z

+    std::vector<float> input = {
+        -2.0f, -0.5f, 1.0f, 3.0f,
+        0.0f, 0.0f, 0.0f, 0.0f,
+        100.0f, 100.0f, -100.0f, -100.0f,
+        5.0f, -5.0f, 2.0f, -2.0f,
+        1.1f, 1.2f, 1.3f, 1.4f,
+        -1.1f, -1.2f, -1.3f, -1.4f,
+        10.0f, 20.0f, 30.0f, 40.0f,
+        -10.0f, -20.0f, -30.0f, -40.0f
+    };


⚠️ Potential issue | 🟡 Minor

Test size misses the tail loops in softmax_v5.

The input is exactly 32 floats, so execution enters the 32-wide main loop exactly once and both tails (i + 7 < n 8-wide loop and the scalar < n remainder) are never exercised. Given this PR's whole point is new code in the hot path that feeds into those same tails, and given test_softmax_v3/v4 use a 40-element vector (1 iteration of the 8-wide tail), test_softmax_v5 is actually a regression in tail coverage.

Consider bumping the input to a size like 41 or 45 so all three loop phases run at least once:

Proposed additional coverage

std::vector<float> input = { -2.0f, -0.5f, 1.0f, 3.0f, 0.0f, 0.0f, 0.0f, 0.0f, 100.0f, 100.0f, -100.0f, -100.0f, 5.0f, -5.0f, 2.0f, -2.0f, 1.1f, 1.2f, 1.3f, 1.4f, -1.1f, -1.2f, -1.3f, -1.4f, 10.0f, 20.0f, 30.0f, 40.0f, - -10.0f, -20.0f, -30.0f, -40.0f + -10.0f, -20.0f, -30.0f, -40.0f, + // 8-wide tail + 0.25f, -0.25f, 0.75f, -0.75f, 1.5f, -1.5f, 2.5f, -2.5f, + // scalar tail + 0.1f, -0.1f, 0.3f };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::vector<float> input = {

-2.0f, -0.5f, 1.0f, 3.0f,

0.0f, 0.0f, 0.0f, 0.0f,

100.0f, 100.0f, -100.0f, -100.0f,

5.0f, -5.0f, 2.0f, -2.0f,

1.1f, 1.2f, 1.3f, 1.4f,

-1.1f, -1.2f, -1.3f, -1.4f,

10.0f, 20.0f, 30.0f, 40.0f,

-10.0f, -20.0f, -30.0f, -40.0f

};

std::vector<float> input = {

-2.0f, -0.5f, 1.0f, 3.0f,

0.0f, 0.0f, 0.0f, 0.0f,

100.0f, 100.0f, -100.0f, -100.0f,

5.0f, -5.0f, 2.0f, -2.0f,

1.1f, 1.2f, 1.3f, 1.4f,

-1.1f, -1.2f, -1.3f, -1.4f,

10.0f, 20.0f, 30.0f, 40.0f,

-10.0f, -20.0f, -30.0f, -40.0f,

// 8-wide tail

0.25f, -0.25f, 0.75f, -0.75f, 1.5f, -1.5f, 2.5f, -2.5f,

// scalar tail

0.1f, -0.1f, 0.3f

};

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ml_kernels/src/test_naive_ops.cpp` around lines 157 - 166, The test uses a 32-element input vector named input in test_softmax_v5 which only exercises the 32-wide main loop and never hits the 8-wide tail or scalar remainder in softmax_v5; change the input vector length to e.g. 41 or 45 (add extra float values) so that the code executes the 32-wide main loop once, the 8-wide tail (i + 7 < n) at least once, and the scalar remainder (< n) at least once, leaving test_softmax_v5 and the input variable name unchanged so the new tail code is covered by the test.

⚡ Thunderbolt: Softmax — Optimized exp256 range reduction and polynom…

25f63cf

…ial eval Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

bugparty merged commit 88205de into main Apr 24, 2026
8 checks passed

This was referenced May 7, 2026

⚡ Thunderbolt: softmax_v6 — AVX2 explicit instruction interleaving #35

Open

⚡ Thunderbolt: softmax_v6 — AVX-512 Vectorized Softmax #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: Softmax — Optimized exp256 range reduction and polynomial eval#31

⚡ Thunderbolt: Softmax — Optimized exp256 range reduction and polynomial eval#31
bugparty merged 1 commit into
mainfrom
thunderbolt-softmax-v5-18282112880023903289

bugparty commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading