8303762: Optimize vector slice operation with constant index using VPALIGNR instruction#24104
8303762: Optimize vector slice operation with constant index using VPALIGNR instruction#24104jatin-bhateja wants to merge 25 commits intoopenjdk:masterfrom
Conversation
…ALIGNR instruction
|
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
|
@jatin-bhateja This change is no longer ready for integration - check the PR body for details. |
|
@jatin-bhateja The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
|
/label add hotspot-compiler-dev |
|
@jatin-bhateja |
|
@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
|
/keepalive |
|
@jatin-bhateja The pull request is being re-evaluated and the inactivity timeout has been reset. |
|
@jatin-bhateja this pull request can not be integrated into git checkout JDK-8303762
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
@jatin-bhateja This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a |
3d09134 to
c0b9eea
Compare
c0b9eea to
607a8fc
Compare
|
Performance after AVX2 backend modifications |
Webrevs
|
|
I'm still not convinced with this solution. If the pattern matching method proves itself to be not reliable, then we can proceed with an intrinsics. Otherwise, we risk introduce a change that will eventually become redundant. |
Hi @merykitty , As discussed earlier your suggestions were incorporated in latest version of patch, idea here is not to hold an optimization in anticipation of future optimization. x86 backend changes will still be usable if at later point we decide to use complex pattern matching once TypeVect has constant information. What we have currently is generic handling which can inline any fallback after failed intrinsification attempts. Looking forward to your comments on backend part and any further improvement on existing handling. Hi @sviswa7 , @iwanowww , May I request you to share your views / comments |
|
I briefly looked at the patch. First of all, I suggest to separate the logic to handle intrinsification failures. It's not specific to the proposed enhancement and will improve handling of intrinsification failures for vector operations. Speaking of proposed approach, it aligns well with current Vector API implementation practices. I agree it would be nice to automatically detect equivalent IR shapes and transform them accordingly, but if it means hard-coding the shape of |
Thanks @iwanowww , I agree that approach to inline on intrinsic failure is generic enough and can benefit other vector operations also as it may absorb boxing penalties. For slice and un-slice since the fallback is completely written in vector APIs it will give most benefits and that is the focus of this patch. Looking forward to your other comments on current implementation. |
|
@jatin-bhateja I agree with @iwanowww that the PR could be split into two: One handling the intrinsification failure/fallback handling and other with vector slice optimization for x86. That might help you to get reviews on this work. I volunteer to review the x86 PR. Order wise, the fallback PR would need to get in first though. |
Hi @sviswa7 , May I request you to kindly review the x86 backend implementation part of this pull request and share your feedback. Best Regards |
|
Hi @iwanowww , kindly let us know your comments on current implementation. |
|
Hi @sviswa7 , your comments have been addressed, kindly verify |
|
/template append |
|
@jatin-bhateja The pull request template has been appended to the pull request body |
sviswa7
left a comment
There was a problem hiding this comment.
x86 changes look good to me. You will need another review from compiler folks for the changes in the call generator to handle fallback.
iwanowww
left a comment
There was a problem hiding this comment.
It would be much simpler to review inlining-related and VectorSlice-related parts separately.
| GrowableArray<CallGenerator*> _boxing_late_inlines; // same but for boxing operations | ||
|
|
||
| GrowableArray<CallGenerator*> _vector_reboxing_late_inlines; // same but for vector reboxing operations | ||
| GrowableArray<CallGenerator*> _vector_late_inlines; // inline fallback implementation for failed intrinsics |
There was a problem hiding this comment.
What's the motivation for a separate list? Why don't you perform fallback inlining when intrinsification attempt fails?
There was a problem hiding this comment.
It was to give intrinsic another chance to succeed if it fails due to non-constant context on first attempt,
#24104 (comment)
Currently, if intrinsification fails then we set the generator for CallStaticJavaNode in Compile::inline_incrementally_one
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/compile.cpp#L2108
Compile::inline_incrementally_cleanup called after Compile::inline_incrementally_one internally calls IGVN optimizations
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/compile.cpp#L2213
CallStaticJavaNode idealization then re-injects the the failed intrinsic call node to _late_inlines list for another intrinsification attempt.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/callnode.cpp#L1175
If we inline the fallback on first intrinsification failure then we loose another opportunity to intrinsify, _vector_late_inlines collects such callgenerators and then once we are through with intrinsification attempts it inline the failed intrinsic calls towards the end on the lines of _string_late_inlines.
There was a problem hiding this comment.
Ok, but you could delay vector operation intrinsification until a full round of late inlining is over and then dispatch between intrinsic and fallback implementation.
Overall, I'm not fully satisfied with current implementation. Please, extract it in a separate PR and let's discuss it there.
There was a problem hiding this comment.
Hi @iwanowww
This pull request performs partial intrinsification of slice API and if it does not succeed then we attempt inlining vector API based fallback implementation. moving compiler side change into a new PR will also involve factoring out Java side changes related to slice.
I agree with you that existing handling in CallGenerator::do_late_inline_helper is somewhat messy, I have cleaned up the handling for populating _vector_late_lines in the latest patch. Request your to kindly have a re-look at the change and let me know if this looks fine now.
Best Regards
|
The total number of required reviews for this PR has been set to 2 based on the presence of these labels: |
Patch optimizes Vector. slice operation with constant index using x86 ALIGNR instruction.
It also adds a new hybrid call generator to facilitate lazy intrinsification or else perform procedural inlining to prevent call overhead and boxing penalties in case the fallback implementation expects to operate over vectors. The existing vector API-based slice implementation is now the fallback code that gets inlined in case intrinsification fails.
Idea here is to add infrastructure support to enable intrinsification of fast path for selected vector APIs, else enable inlining of fall-back implementation if it's based on vector APIs. Existing call generators like PredictedCallGenerator, used to handle bi-morphic inlining, already make use of multiple call generators to handle hit/miss scenarios for a particular receiver type. The newly added hybrid call generator is lazy and called during incremental inlining optimization. It also relieves the inline expander to handle slow paths, which can easily be implemented library side (Java).
Vector API jtreg tests pass at AVX level 2, remaining validation in progress.
Performance numbers:
Please share your feedback.
Best Regards,
Jatin
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24104/head:pull/24104$ git checkout pull/24104Update a local copy of the PR:
$ git checkout pull/24104$ git pull https://git.openjdk.org/jdk.git pull/24104/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24104View PR using the GUI difftool:
$ git pr show -t 24104Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24104.diff
Using Webrev
Link to Webrev Comment