Enhance the coverage of FP8 TN gridbased on Navi4x#4209
Merged
ericwan-amd merged 5 commits intodevelopfrom Feb 24, 2026
Merged
Conversation
9721f80 to
bc7f34f
Compare
perfci run on commit 1560727 |
[note]: * tuned with DTV enabled * extended problem size to support frequently used models
[note]: * tuned with DTV enabled * extended gridpoints to support frequently used models
… handling on gfx1201
… handling on gfx1200
1560727 to
9830931
Compare
perfci run on commit 9830931 |
cmingch
approved these changes
Feb 24, 2026
wenchuanchen
approved these changes
Feb 24, 2026
aosewski
pushed a commit
that referenced
this pull request
Feb 24, 2026
## Motivation This PR aims to enhance the performance of the gridbased kernel in nav4x by enabling both DTVA and DTVB tuning for FP8 TN. Additionally, it introduces problem size distributions based on the previous version to better reflect realistic workloads. <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details * Enabled DTVA and DTVB for more comprehensive tuning coverage * Expanded the tuning MT combinations and other params * Replaced deprecated logic YAML naming conventions for consistency * Extended tuning support from f8f8s to all f8 datatype The tuning results of F8F8S will be like the below figures: **gfx1201:** <img width="288" height="300" alt="image" src="https://github.com/user-attachments/assets/65411684-4958-4842-a8d1-e4c2151a57f0" /> **gfx1200:** <img width="290" height="304" alt="image" src="https://github.com/user-attachments/assets/f012ce9c-9ffa-4bf5-be1c-ba582937dec5" /> <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan Verified locally using hipblaslt-test on gfx1200 and gfx1201 platforms. <!-- Explain any relevant testing done to verify this PR. --> ## Test Result * gfx1201 local hipblaslt-test result: `[----------] Global test environment tear-down [==========] 40101 tests from 11 test suites ran. (388998 ms total) [ PASSED ] 40101 tests. hipBLASLt version: 100200 hipBLASLt git version: command line: ./build/release/clients/hipblaslt-test ` * gfx1200 local hipblaslt-test result `[----------] Global test environment tear-down [==========] 40101 tests from 11 test suites ran. (473105 ms total) [ PASSED ] 40101 tests. hipBLASLt version: 100200 hipBLASLt git version: command line: ./build/release/clients/hipblaslt-test` <!-- Briefly summarize test outcomes. --> ## Related Tickets - https://amd-hub.atlassian.net/browse/AIHPBLAS-776 ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: ericwan-amd <Eric.Wang@amd.com>
NaveenElumalaiAMD
pushed a commit
that referenced
this pull request
Mar 6, 2026
## Motivation This PR aims to enhance the performance of the gridbased kernel in nav4x by enabling both DTVA and DTVB tuning for FP8 TN. Additionally, it introduces problem size distributions based on the previous version to better reflect realistic workloads. <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details * Enabled DTVA and DTVB for more comprehensive tuning coverage * Expanded the tuning MT combinations and other params * Replaced deprecated logic YAML naming conventions for consistency * Extended tuning support from f8f8s to all f8 datatype The tuning results of F8F8S will be like the below figures: **gfx1201:** <img width="288" height="300" alt="image" src="https://github.com/user-attachments/assets/65411684-4958-4842-a8d1-e4c2151a57f0" /> **gfx1200:** <img width="290" height="304" alt="image" src="https://github.com/user-attachments/assets/f012ce9c-9ffa-4bf5-be1c-ba582937dec5" /> <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan Verified locally using hipblaslt-test on gfx1200 and gfx1201 platforms. <!-- Explain any relevant testing done to verify this PR. --> ## Test Result * gfx1201 local hipblaslt-test result: `[----------] Global test environment tear-down [==========] 40101 tests from 11 test suites ran. (388998 ms total) [ PASSED ] 40101 tests. hipBLASLt version: 100200 hipBLASLt git version: command line: ./build/release/clients/hipblaslt-test ` * gfx1200 local hipblaslt-test result `[----------] Global test environment tear-down [==========] 40101 tests from 11 test suites ran. (473105 ms total) [ PASSED ] 40101 tests. hipBLASLt version: 100200 hipBLASLt git version: command line: ./build/release/clients/hipblaslt-test` <!-- Briefly summarize test outcomes. --> ## Related Tickets - https://amd-hub.atlassian.net/browse/AIHPBLAS-776 ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: ericwan-amd <Eric.Wang@amd.com>
jovanau
pushed a commit
to jovanau/rocm-libraries
that referenced
this pull request
Mar 19, 2026
## Motivation This PR aims to enhance the performance of the gridbased kernel in nav4x by enabling both DTVA and DTVB tuning for FP8 TN. Additionally, it introduces problem size distributions based on the previous version to better reflect realistic workloads. <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details * Enabled DTVA and DTVB for more comprehensive tuning coverage * Expanded the tuning MT combinations and other params * Replaced deprecated logic YAML naming conventions for consistency * Extended tuning support from f8f8s to all f8 datatype The tuning results of F8F8S will be like the below figures: **gfx1201:** <img width="288" height="300" alt="image" src="https://github.com/user-attachments/assets/65411684-4958-4842-a8d1-e4c2151a57f0" /> **gfx1200:** <img width="290" height="304" alt="image" src="https://github.com/user-attachments/assets/f012ce9c-9ffa-4bf5-be1c-ba582937dec5" /> <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan Verified locally using hipblaslt-test on gfx1200 and gfx1201 platforms. <!-- Explain any relevant testing done to verify this PR. --> ## Test Result * gfx1201 local hipblaslt-test result: `[----------] Global test environment tear-down [==========] 40101 tests from 11 test suites ran. (388998 ms total) [ PASSED ] 40101 tests. hipBLASLt version: 100200 hipBLASLt git version: command line: ./build/release/clients/hipblaslt-test ` * gfx1200 local hipblaslt-test result `[----------] Global test environment tear-down [==========] 40101 tests from 11 test suites ran. (473105 ms total) [ PASSED ] 40101 tests. hipBLASLt version: 100200 hipBLASLt git version: command line: ./build/release/clients/hipblaslt-test` <!-- Briefly summarize test outcomes. --> ## Related Tickets - https://amd-hub.atlassian.net/browse/AIHPBLAS-776 ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: ericwan-amd <Eric.Wang@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR aims to enhance the performance of the gridbased kernel in nav4x by enabling both DTVA and DTVB tuning for FP8 TN. Additionally, it introduces problem size distributions based on the previous version to better reflect realistic workloads.
Technical Details
The tuning results of F8F8S will be like the below figures:

gfx1201:
gfx1200:

Test Plan
Verified locally using hipblaslt-test on gfx1200 and gfx1201 platforms.
Test Result
gfx1201 local hipblaslt-test result:
[----------] Global test environment tear-down [==========] 40101 tests from 11 test suites ran. (388998 ms total) [ PASSED ] 40101 tests. hipBLASLt version: 100200 hipBLASLt git version: command line: ./build/release/clients/hipblaslt-testgfx1200 local hipblaslt-test result
[----------] Global test environment tear-down [==========] 40101 tests from 11 test suites ran. (473105 ms total) [ PASSED ] 40101 tests. hipBLASLt version: 100200 hipBLASLt git version: command line: ./build/release/clients/hipblaslt-testRelated Tickets
Submission Checklist