[hipblaslt][cms] Add BF16 256x96x64 NT CMS#4207
[hipblaslt][cms] Add BF16 256x96x64 NT CMS#4207jinchen62 wants to merge 1 commit intoROCm:hipblaslt_common_cms_phase2from
Conversation
|
Have you tried triple LDS buffer? There is a new parameter in tensilelite to support 3LDS with PGR2: DtlPlusLdsBuf |
msujon-AMD
left a comment
There was a problem hiding this comment.
need to compare the perf with DtlPlusLdsBuf tuning parameter and may need to reschedule after applying it to start GR early.
|
@msujon-AMD I think |
Let's try it from develop and see if we get better performance. We will sync phase2 branch with develop soon. |
|
@msujon-AMD I tried it from develop branch and got but I have been dealing with numeric issue for cms + DtlPlusLdsBuf. Something seems suspicious to me. When I turned on DtlPlusLdsBuf, LWSB requires 0 instruction from 1, LRSA requires 4 from 1, LWSA requires 3 from 1. Not sure if I missed some understanding of DtlPlusLdsBuf, could you pls explain more and what's the actual changes we could do with it? |
That is expected. We need 3 operations for rotating LDS buffer. |
So, the trade off is extra VALU insts vs better scheduling of GR! |
It is small thing, but LDS offset rotation is scalar operation, not vector. |
That's actually a very good news! We have extra slot right after MFMA to schedule SALU instruction. That means, we should be able to overlap the extra instructions completely and should not have any exposed cycles :) |
|
@msujon-AMD @nakajee I was not able to get much improvement. I got 125.1 us which is about 1% more. Scheduling GRB earlier might perform worse or cause numeric issue. The following is the cms with DtlPlusLdsBuf. |
Thanks for your update. |
I realized PGR3 and DtlPlusLdsBuf enablement change has not been merged into cms_phase2 branch yet. |
I created a new PR. If possible, please try it to see if we can improve perf with DtlPlusLdsBuf |
|
@nakajee Actually I did comment out the line of enforcing “use cms” == 0 with DtlPlusLdsBuf, made sure it’s using cms. |
OK. Thanks. That means you used develop branch, right? |
|
Plus, you do not need sync before you start GRA |
|
@nakajee Got it. I will be able to try tomorrow evening. |
82e22ab to
dd01c82
Compare
## Motivation CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf Open a new PR from #4207 ## Test Result On MI350 **Tensile, no CMS vs CMS** MNK = 4096,1536,8192 - Time: 1.69% improvement - Efficiency: 60.9% --> 62.8% **Bench, Baseline vs CMS** MNK = 4096,1536,8192 - Time: 7.02% improvement - Efficiency: 60.9% --> 62.8% AIGECORE-92 --------- Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>
## Motivation CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf Open a new PR from #4207 ## Test Result On MI350 **Tensile, no CMS vs CMS** MNK = 4096,1536,8192 - Time: 1.69% improvement - Efficiency: 60.9% --> 62.8% **Bench, Baseline vs CMS** MNK = 4096,1536,8192 - Time: 7.02% improvement - Efficiency: 60.9% --> 62.8% AIGECORE-92 --------- Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>
) ## Motivation CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf Open a new PR from ROCm#4207 ## Test Result On MI350 **Tensile, no CMS vs CMS** MNK = 4096,1536,8192 - Time: 1.69% improvement - Efficiency: 60.9% --> 62.8% **Bench, Baseline vs CMS** MNK = 4096,1536,8192 - Time: 7.02% improvement - Efficiency: 60.9% --> 62.8% AIGECORE-92 --------- Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>
## Motivation CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf Open a new PR from #4207 ## Test Result On MI350 **Tensile, no CMS vs CMS** MNK = 4096,1536,8192 - Time: 1.69% improvement - Efficiency: 60.9% --> 62.8% **Bench, Baseline vs CMS** MNK = 4096,1536,8192 - Time: 7.02% improvement - Efficiency: 60.9% --> 62.8% AIGECORE-92 --------- Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>
Motivation
CMS for 256x96x64 NT BF16
Test Result
Test for 4096x1536x8192
Tensile:
Default: 131.721 us
CMS: 126.369 us
Speedup: 4.06%