8303762: Optimize vector slice operation with constant index using VPALIGNR instruction by jatin-bhateja · Pull Request #24104 · openjdk/jdk

jatin-bhateja · 2025-03-18T20:51:46Z

Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction.
It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails.

Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java).

Vector API jtreg tests pass at AVX level 2, remaining validation in progress.

Performance numbers:


System : 13th Gen Intel(R) Core(TM) i3-1315U

Baseline:
Benchmark                                                (size)   Mode  Cnt      Score   Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2   9444.444          ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  10009.319          ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   9081.926          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2   6085.825          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2   6505.378          ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   6204.489          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2   1651.334          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   1642.784          ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   1474.808          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  10399.394          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  10502.894          ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    2   9756.573          ops/ms

With opt:
Benchmark                                                (size)   Mode  Cnt      Score   Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  34122.435          ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  33281.868          ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   9345.154          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2   8283.247          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2   8510.695          ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   5626.367          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2    960.958          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   4155.801          ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   1465.953          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  32748.061          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  33674.408          ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    2   9346.148          ops/ms

Please share your feedback.

Best Regards,
Jatin

I confirm that I make this contribution in accordance with the OpenJDK Interim AI Policy.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

JDK-8303762: Optimize vector slice operation with constant index using VPALIGNR instruction (Enhancement - P4)

Reviewers

Xiaohong Gong (@XiaohongGong - Committer) 🔄 Re-review required (review applies to 9625b04e)
Eric Fang (@erifan - Author) 🔄 Re-review required (review applies to ae242926)
Sandhya Viswanathan (@sviswa7 - Reviewer) 🔄 Re-review required (review applies to c5950031)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24104/head:pull/24104
$ git checkout pull/24104

Update a local copy of the PR:
$ git checkout pull/24104
$ git pull https://git.openjdk.org/jdk.git pull/24104/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24104

View PR using the GUI difftool:
$ git pr show -t 24104

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24104.diff

Using Webrev

Link to Webrev Comment

…ALIGNR instruction

bridgekeeper · 2025-03-18T20:52:50Z

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-03-18T20:53:17Z

@jatin-bhateja This change is no longer ready for integration - check the PR body for details.

openjdk · 2025-03-18T20:53:55Z

@jatin-bhateja The following labels will be automatically applied to this pull request:

core-libs
graal
hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

jatin-bhateja · 2025-03-18T20:54:23Z

/label add hotspot-compiler-dev

openjdk · 2025-03-18T20:55:02Z

@jatin-bhateja
The hotspot-compiler label was successfully added.

bridgekeeper · 2025-05-13T21:01:30Z

@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

jatin-bhateja · 2025-05-18T01:41:49Z

/keepalive

openjdk · 2025-05-18T01:42:45Z

@jatin-bhateja The pull request is being re-evaluated and the inactivity timeout has been reset.

openjdk · 2025-05-18T01:44:01Z

@jatin-bhateja this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8303762
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

bridgekeeper · 2025-07-13T06:02:16Z

@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

jatin-bhateja · 2025-07-25T02:55:41Z

Performance after AVX2 backend modifications

Benchmark                                                (size)   Mode  Cnt      Score   Error   Units
VectorSliceBenchmark.byteVectorSliceWithConstantIndex1     1024  thrpt    2  51644.530          ops/ms
VectorSliceBenchmark.byteVectorSliceWithConstantIndex2     1024  thrpt    2  48171.079          ops/ms
VectorSliceBenchmark.byteVectorSliceWithVariableIndex      1024  thrpt    2   9662.306          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex1      1024  thrpt    2  14358.347          ops/ms
VectorSliceBenchmark.intVectorSliceWithConstantIndex2      1024  thrpt    2  14619.920          ops/ms
VectorSliceBenchmark.intVectorSliceWithVariableIndex       1024  thrpt    2   6675.824          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex1     1024  thrpt    2    818.911          ops/ms
VectorSliceBenchmark.longVectorSliceWithConstantIndex2     1024  thrpt    2   4778.321          ops/ms
VectorSliceBenchmark.longVectorSliceWithVariableIndex      1024  thrpt    2   1612.264          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex1    1024  thrpt    2  35961.146          ops/ms
VectorSliceBenchmark.shortVectorSliceWithConstantIndex2    1024  thrpt    2  39072.170          ops/ms
VectorSliceBenchmark.shortVectorSliceWithVariableIndex     1024  thrpt    2  11209.685          ops/ms

mlbridge · 2025-07-25T13:50:54Z

Webrevs

XiaohongGong

Still LGTM!

merykitty · 2026-03-12T16:47:34Z

I'm still not convinced with this solution. If the pattern matching method proves itself to be not reliable, then we can proceed with an intrinsics. Otherwise, we risk introduce a change that will eventually become redundant.

jatin-bhateja · 2026-03-13T05:47:39Z

I'm still not convinced with this solution. If the pattern matching method proves itself to be not reliable, then we can proceed with an intrinsics. Otherwise, we risk introduce a change that will eventually become redundant.

Hi @merykitty , As discussed earlier your suggestions were incorporated in latest version of patch, idea here is not to hold an optimization in anticipation of future optimization. x86 backend changes will still be usable if at later point we decide to use complex pattern matching once TypeVect has constant information. What we have currently is generic handling which can inline any fallback after failed intrinsification attempts. Looking forward to your comments on backend part and any further improvement on existing handling.

Hi @sviswa7 , @iwanowww , May I request you to share your views / comments

iwanowww · 2026-03-14T00:33:35Z

I briefly looked at the patch.

First of all, I suggest to separate the logic to handle intrinsification failures. It's not specific to the proposed enhancement and will improve handling of intrinsification failures for vector operations.

Speaking of proposed approach, it aligns well with current Vector API implementation practices. I agree it would be nice to automatically detect equivalent IR shapes and transform them accordingly, but if it means hard-coding the shape of sliceTemplate into the compiler, current proposal does look well-justified.

jatin-bhateja · 2026-03-30T06:08:33Z

I briefly looked at the patch.

First of all, I suggest to separate the logic to handle intrinsification failures. It's not specific to the proposed enhancement and will improve handling of intrinsification failures for vector operations.

Speaking of proposed approach, it aligns well with current Vector API implementation practices. I agree it would be nice to automatically detect equivalent IR shapes and transform them accordingly, but if it means hard-coding the shape of sliceTemplate into the compiler, current proposal does look well-justified.

Thanks @iwanowww , I agree that approach to inline on intrinsic failure is generic enough and can benefit other vector operations also as it may absorb boxing penalties. For slice and un-slice since the fallback is completely written in vector APIs it will give most benefits and that is the focus of this patch.

Looking forward to your other comments on current implementation.

sviswa7 · 2026-03-31T16:25:50Z

@jatin-bhateja I agree with @iwanowww that the PR could be split into two: One handling the intrinsification failure/fallback handling and other with vector slice optimization for x86. That might help you to get reviews on this work. I volunteer to review the x86 PR. Order wise, the fallback PR would need to get in first though.

jatin-bhateja · 2026-04-01T04:35:01Z

@jatin-bhateja I agree with @iwanowww that the PR could be split into two: One handling the intrinsification failure/fallback handling and other with vector slice optimization for x86. That might help you to get reviews on this work. I volunteer to review the x86 PR. Order wise, the fallback PR would need to get in first though.

Hi @sviswa7 ,
Almost all the fallback code apart from few (unslice, slice etc) use scalar operation loop to compute the result, a box created on caller side on account of failed intrinsic will not be unboxed on callee side i.e. fall back implementation. In this context, inlining the fallback will save the call overhead but not prevent boxing penalty or code bloating on callee side which may have other side effects. Which is why this PR selectively enables in lining of slice fallback which is composed of vector APIs and code for that is part of this pull request.

May I request you to kindly review the x86 backend implementation part of this pull request and share your feedback.

Best Regards

jatin-bhateja · 2026-04-06T15:55:29Z

Hi @iwanowww , kindly let us know your comments on current implementation.

jatin-bhateja · 2026-04-07T08:46:35Z

Hi @sviswa7 , your comments have been addressed, kindly verify

jatin-bhateja · 2026-04-14T08:57:32Z

/template append

openjdk · 2026-04-14T09:00:14Z

@jatin-bhateja The pull request template has been appended to the pull request body

sviswa7

x86 changes look good to me. You will need another review from compiler folks for the changes in the call generator to handle fallback.

iwanowww

It would be much simpler to review inlining-related and VectorSlice-related parts separately.

iwanowww · 2026-04-15T19:46:17Z

  GrowableArray<CallGenerator*> _boxing_late_inlines; // same but for boxing operations

  GrowableArray<CallGenerator*> _vector_reboxing_late_inlines; // same but for vector reboxing operations
+  GrowableArray<CallGenerator*> _vector_late_inlines; // inline fallback implementation for failed intrinsics


What's the motivation for a separate list? Why don't you perform fallback inlining when intrinsification attempt fails?

It was to give intrinsic another chance to succeed if it fails due to non-constant context on first attempt,
#24104 (comment)

Currently, if intrinsification fails then we set the generator for CallStaticJavaNode in Compile::inline_incrementally_one
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/compile.cpp#L2108

Compile::inline_incrementally_cleanup called after Compile::inline_incrementally_one internally calls IGVN optimizations
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/compile.cpp#L2213

CallStaticJavaNode idealization then re-injects the the failed intrinsic call node to _late_inlines list for another intrinsification attempt.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/callnode.cpp#L1175

If we inline the fallback on first intrinsification failure then we loose another opportunity to intrinsify, _vector_late_inlines collects such callgenerators and then once we are through with intrinsification attempts it inline the failed intrinsic calls towards the end on the lines of _string_late_inlines.

https://github.com/openjdk/jdk/pull/24104/changes#diff-f076857d7da81f56709da3de1511b1105727032186cde4d02c678667761f46eaR2252

Ok, but you could delay vector operation intrinsification until a full round of late inlining is over and then dispatch between intrinsic and fallback implementation.

Overall, I'm not fully satisfied with current implementation. Please, extract it in a separate PR and let's discuss it there.

Hi @iwanowww
This pull request performs partial intrinsification of slice API and if it does not succeed then we attempt inlining vector API based fallback implementation. moving compiler side change into a new PR will also involve factoring out Java side changes related to slice.

I agree with you that existing handling in CallGenerator::do_late_inline_helper is somewhat messy, I have cleaned up the handling for populating _vector_late_lines in the latest patch. Request your to kindly have a re-look at the change and let me know if this looks fine now.

Best Regards

openjdk · 2026-04-16T20:12:33Z

The total number of required reviews for this PR has been set to 2 based on the presence of these labels: hotspot, hotspot-compiler. This can be overridden with the /reviewers command.

8303762: Optimize vector slice operation with constant index using VP…

2a17c5d

…ALIGNR instruction

openjdk bot added graal graal-dev@openjdk.org hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Mar 18, 2025

openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 18, 2025

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label May 18, 2025

bridgekeeper bot added oca Needs verification of OCA signatory status and removed oca Needs verification of OCA signatory status labels Jul 15, 2025

jbhateja added 2 commits July 23, 2025 22:29

Merge branch 'master' of https://github.com/openjdk/jdk into JDK-8303762

7385c75

new benchmark

edf51e7

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jul 24, 2025

jatin-bhateja force-pushed the JDK-8303762 branch from 3d09134 to c0b9eea Compare July 25, 2025 02:53

Optimizing AVX2 backend and some re-factoring

607a8fc

jatin-bhateja force-pushed the JDK-8303762 branch from c0b9eea to 607a8fc Compare July 25, 2025 02:55

Fixes for failing regressions

b2e9343

jatin-bhateja marked this pull request as ready for review July 25, 2025 13:40

openjdk bot added the rfr Pull request is ready for review label Jul 25, 2025

Updating predicate checks

04be59a

XiaohongGong approved these changes Mar 12, 2026

View reviewed changes

sviswa7 reviewed Apr 6, 2026

View reviewed changes

Review comments resolutions

2b8f0b4

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Apr 7, 2026

Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8303762

bde0c21

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Apr 7, 2026

sviswa7 reviewed Apr 7, 2026

View reviewed changes

Comment thread src/hotspot/cpu/x86/x86.ad Outdated

Comment thread src/hotspot/cpu/x86/x86.ad Outdated

jatin-bhateja added 2 commits April 8, 2026 10:02

Review comments resolutions

121c40a

Review comments resolution

ad7151e

sviswa7 reviewed Apr 13, 2026

View reviewed changes

Comment thread src/hotspot/cpu/x86/assembler_x86.cpp Outdated

Comment thread src/hotspot/cpu/x86/x86.ad Outdated

Review comments resolution

2834a02

openjdk bot removed the rfr Pull request is ready for review label Apr 14, 2026

openjdk bot added the rfr Pull request is ready for review label Apr 14, 2026

Review comments resolution

c595003

sviswa7 approved these changes Apr 15, 2026

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Apr 15, 2026

iwanowww reviewed Apr 15, 2026

View reviewed changes

openjdk bot removed the ready Pull request is ready to be integrated label Apr 16, 2026

Review comments resolutions

46fcc9a

Conversation

jatin-bhateja commented Mar 18, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Mar 18, 2025

Uh oh!

openjdk bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jatin-bhateja commented Mar 18, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Mar 18, 2025

Uh oh!

bridgekeeper bot commented May 13, 2025

Uh oh!

jatin-bhateja commented May 18, 2025 • edited by bridgekeeper bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented May 18, 2025

Uh oh!

openjdk bot commented May 18, 2025

Uh oh!

bridgekeeper bot commented Jul 13, 2025

Uh oh!

jatin-bhateja commented Jul 25, 2025

Uh oh!

mlbridge bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

XiaohongGong left a comment

Choose a reason for hiding this comment

Uh oh!

merykitty commented Mar 12, 2026

Uh oh!

jatin-bhateja commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iwanowww commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jatin-bhateja commented Mar 30, 2026

Uh oh!

sviswa7 commented Mar 31, 2026

Uh oh!

jatin-bhateja commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jatin-bhateja commented Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jatin-bhateja commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jatin-bhateja commented Apr 14, 2026

Uh oh!

openjdk bot commented Apr 14, 2026

Uh oh!

sviswa7 left a comment

Choose a reason for hiding this comment

Uh oh!

iwanowww left a comment

Choose a reason for hiding this comment

jatin-bhateja commented Mar 18, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Mar 18, 2025 •

edited

Loading

openjdk bot commented Mar 18, 2025 •

edited

Loading

jatin-bhateja commented Mar 18, 2025 •

edited by bridgekeeper bot

Loading

jatin-bhateja commented May 18, 2025 •

edited by bridgekeeper bot

Loading

mlbridge bot commented Jul 25, 2025 •

edited

Loading

jatin-bhateja commented Mar 13, 2026 •

edited

Loading

iwanowww commented Mar 14, 2026 •

edited

Loading

jatin-bhateja commented Apr 1, 2026 •

edited

Loading

jatin-bhateja commented Apr 7, 2026 •

edited

Loading

jatin-bhateja Apr 17, 2026 •

edited

Loading