[hipblaslt][cms] Add BF16 256x96x64 NT CMS by jinchen62 · Pull Request #4207 · ROCm/rocm-libraries

jinchen62 · 2026-02-03T05:04:50Z

Motivation

CMS for 256x96x64 NT BF16

Test Result

Test for 4096x1536x8192

Tensile:
Default: 131.721 us
CMS: 126.369 us
Speedup: 4.06%

msujon-AMD · 2026-02-03T19:07:17Z

Have you tried triple LDS buffer? There is a new parameter in tensilelite to support 3LDS with PGR2: DtlPlusLdsBuf
It will remove the WAR dependency on same LDS buffer at the beginning and you can start GR much early.

msujon-AMD

need to compare the perf with DtlPlusLdsBuf tuning parameter and may need to reschedule after applying it to start GR early.

jinchen62 · 2026-02-04T21:38:44Z

@msujon-AMD I think DtlPlusLdsBuf is integrated in develop branch not phase2 branch. How would you suggest to work with it? Do I rebase develop branch or cherry pick the DtlPlusLdsBuf integration?

msujon-AMD · 2026-02-04T22:02:25Z

@msujon-AMD I think DtlPlusLdsBuf is integrated in develop branch not phase2 branch. How would you suggest to work with it? Do I rebase develop branch or cherry pick the DtlPlusLdsBuf integration?

Let's try it from develop and see if we get better performance. We will sync phase2 branch with develop soon.

jinchen62 · 2026-02-06T00:51:25Z

@msujon-AMD I tried it from develop branch and got
no cms: 131us
no cms + DtlPlusLdsBuf: 127us
cms: 126us

but I have been dealing with numeric issue for cms + DtlPlusLdsBuf. Something seems suspicious to me. When I turned on DtlPlusLdsBuf, LWSB requires 0 instruction from 1, LRSA requires 4 from 1, LWSA requires 3 from 1. Not sure if I missed some understanding of DtlPlusLdsBuf, could you pls explain more and what's the actual changes we could do with it?

nakajee · 2026-02-06T03:01:03Z

@msujon-AMD I tried it from develop branch and got no cms: 131us no cms + DtlPlusLdsBuf: 127us cms: 126us

but I have been dealing with numeric issue for cms + DtlPlusLdsBuf. Something seems suspicious to me. When I turned on DtlPlusLdsBuf, LWSB requires 0 instruction from 1, LRSA requires 4 from 1, LWSA requires 3 from 1. Not sure if I missed some understanding of DtlPlusLdsBuf, could you pls explain more and what's the actual changes we could do with it?

That is expected.
With DtlPlusLdsBuf, LDS buffer number is more than 2.
We are no longer able to use xor swap operation to switch LDS buffer.
Instead, we have LDS buffer inc addr for write and read separately (common for A and B).

We need 3 operations for rotating LDS buffer.
Local Read case, we need to update Vgpr offset for A and B separately (1 + 1 = 2 v operations).
In total,
LocalWrite: 3 instructions (in total for A and B)
LocalRead: 5 instructions (3 (common) + 1 (A) + 1 (B))

msujon-AMD · 2026-02-06T15:09:39Z

@msujon-AMD I tried it from develop branch and got no cms: 131us no cms + DtlPlusLdsBuf: 127us cms: 126us
but I have been dealing with numeric issue for cms + DtlPlusLdsBuf. Something seems suspicious to me. When I turned on DtlPlusLdsBuf, LWSB requires 0 instruction from 1, LRSA requires 4 from 1, LWSA requires 3 from 1. Not sure if I missed some understanding of DtlPlusLdsBuf, could you pls explain more and what's the actual changes we could do with it?

That is expected. With DtlPlusLdsBuf, LDS buffer number is more than 2. We are no longer able to use xor swap operation to switch LDS buffer. Instead, we have LDS buffer inc addr for write and read separately (common for A and B).

We need 3 operations for rotating LDS buffer. Local Read case, we need to update Vgpr offset for A and B separately (1 + 1 = 2 v operations). In total, LocalWrite: 3 instructions (in total for A and B) LocalRead: 5 instructions (3 (common) + 1 (A) + 1 (B))

So, the trade off is extra VALU insts vs better scheduling of GR!
@jinchen62, can try implementing CMS from 3LDSBuffer code and see if you can beat your previous CMS code?

nakajee · 2026-02-06T15:22:42Z

So, the trade off is extra VALU insts vs better scheduling of GR! @jinchen62, can try implementing CMS from 3LDSBuffer code and see if you can beat your previous CMS code?

It is small thing, but LDS offset rotation is scalar operation, not vector.
My expectation is that most of efficiency loss comes from global read wait in mid/small MT case.

msujon-AMD · 2026-02-06T18:10:42Z

So, the trade off is extra VALU insts vs better scheduling of GR! @jinchen62, can try implementing CMS from 3LDSBuffer code and see if you can beat your previous CMS code?

It is small thing, but LDS offset rotation is scalar operation, not vector. My expectation is that most of efficiency loss comes from global read wait in mid/small MT case.

That's actually a very good news! We have extra slot right after MFMA to schedule SALU instruction. That means, we should be able to overlap the extra instructions completely and should not have any exposed cycles :)

jinchen62 · 2026-02-07T09:36:05Z

@msujon-AMD @nakajee I was not able to get much improvement. I got 125.1 us which is about 1% more. Scheduling GRB earlier might perform worse or cause numeric issue. The following is the cms with DtlPlusLdsBuf.

syncTable = [
    -1, SWaitCnt(dscnt=0, vlcnt=-1, vscnt=-1, comment="wait for prior local read local write old=0, new=0 newLW=0 newLR=0 for iteration == 0"),
    9, SWaitCnt(dscnt=0, vlcnt=-1, vscnt=-1, comment=""),
    9, SBarrier(comment=""),
    20, SWaitCnt(dscnt=0, vlcnt=-1, vscnt=-1, comment="wait for prior local read local write old=0, new=0 newLW=0 newLR=0"),
    26, SWaitCnt(dscnt=0, vlcnt=11, vscnt=-1, comment="wait for previous set of global reads"),
    26, SBarrier(comment=""),
    41, SWaitCnt(dscnt=-1, vlcnt=11, vscnt=-1, comment="wait for previous set of global reads"),
    41, SBarrier(comment=""),
]
optSchedule = {
    'SYNC': [syncTable[::2]],
    'GRIncA': [[0, 0, 0, 1, 1, 1, 2, 2, 2]],
    'GRIncB': [[3, 3, 3, 4, 4, 4, 5, 5, 5]],

    'LRA0': [[0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8]],
    'LRB0': [[9, 11, 13, 15, 17, 19]],

    'GRA': [[10, 10, 12, 12, 14, 14, 16, 16, 18, 18, 20, 20, 22, 22, 24, 24]],
    'GRB': [[29, 29, 32, 32, 35, 35]],

    'LRA1': [[27, 27, 28, 28, 30, 30, 31, 31, 33, 33, 34, 34, 36, 37, 38, 39]],
    'LRB1': [[42, 43, 44, 45, 46, 47]],

    'LRSA': [[25, 25, 25, 25]],
    'LRSB': [[25]],
    'LWSA': [[40, 40, 40]],
    'LWSB': [[]],
    'LCC': [[47, 47]],
}

nakajee · 2026-02-07T20:42:48Z

@msujon-AMD @nakajee I was not able to get much improvement. I got 125.1 us which is about 1% more. Scheduling GRB earlier might perform worse or cause numeric issue. The following is the cms with DtlPlusLdsBuf.

Thanks for your update.
I just realized DtlPlusLdsBuf is disabled if CMS is enabled....
We need to update tensilelite to enable DtlPlusLdsBuf for CMSn

nakajee · 2026-02-07T21:37:41Z

@msujon-AMD @nakajee I was not able to get much improvement. I got 125.1 us which is about 1% more. Scheduling GRB earlier might perform worse or cause numeric issue. The following is the cms with DtlPlusLdsBuf.

Thanks for your update. I just realized DtlPlusLdsBuf is disabled if CMS is enabled.... We need to update tensilelite to enable DtlPlusLdsBuf for CMSn

I realized PGR3 and DtlPlusLdsBuf enablement change has not been merged into cms_phase2 branch yet.
I will create a PR to enable PGR3 and DPB on develop branch.

nakajee · 2026-02-07T21:43:23Z

I realized PGR3 and DtlPlusLdsBuf enablement change has not been merged into cms_phase2 branch yet. I will create a PR to enable PGR3 and DPB on develop branch.

I created a new PR.
#4395

If possible, please try it to see if we can improve perf with DtlPlusLdsBuf

jinchen62 · 2026-02-07T23:27:52Z

@nakajee Actually I did comment out the line of enforcing “use cms” == 0 with DtlPlusLdsBuf, made sure it’s using cms.

nakajee · 2026-02-07T23:37:50Z

@nakajee Actually I did comment out the line of enforcing “use cms” == 0 with DtlPlusLdsBuf, made sure it’s using cms.

OK. Thanks. That means you used develop branch, right?
By the way, since this is CMS, you need to manually schedule GRA and GRB forward.
With DtlPlusLdsBuf, you can start GRA right after GRIncA (means from 3).

nakajee · 2026-02-07T23:38:58Z

Plus, you do not need sync before you start GRA

jinchen62 · 2026-02-07T23:40:32Z

@nakajee Got it. I will be able to try tomorrow evening.

## Motivation CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf Open a new PR from #4207 ## Test Result On MI350 **Tensile, no CMS vs CMS** MNK = 4096,1536,8192 - Time: 1.69% improvement - Efficiency: 60.9% --> 62.8% **Bench, Baseline vs CMS** MNK = 4096,1536,8192 - Time: 7.02% improvement - Efficiency: 60.9% --> 62.8% AIGECORE-92 --------- Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>

) ## Motivation CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf Open a new PR from ROCm#4207 ## Test Result On MI350 **Tensile, no CMS vs CMS** MNK = 4096,1536,8192 - Time: 1.69% improvement - Efficiency: 60.9% --> 62.8% **Bench, Baseline vs CMS** MNK = 4096,1536,8192 - Time: 7.02% improvement - Efficiency: 60.9% --> 62.8% AIGECORE-92 --------- Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>

## Motivation CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf Open a new PR from #4207 ## Test Result On MI350 **Tensile, no CMS vs CMS** MNK = 4096,1536,8192 - Time: 1.69% improvement - Efficiency: 60.9% --> 62.8% **Bench, Baseline vs CMS** MNK = 4096,1536,8192 - Time: 7.02% improvement - Efficiency: 60.9% --> 62.8% AIGECORE-92 --------- Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>

jinchen62 requested a review from a team as a code owner February 3, 2026 05:04

jinchen62 requested a review from talumbau February 3, 2026 05:05

github-actions Bot added the project: hipblaslt label Feb 3, 2026

assistant-librarian Bot added the organization: ROCm label Feb 3, 2026

msujon-AMD requested changes Feb 3, 2026

View reviewed changes

Add 256x96x64 nt bf16 cms

dd01c82

jinchen62 force-pushed the bf16_256x96x64_nt branch from 82e22ab to dd01c82 Compare February 12, 2026 03:53

jinchen62 changed the base branch from hipblaslt_common_cms_phase2 to develop February 12, 2026 03:53

jinchen62 requested a review from a team as a code owner February 12, 2026 03:53

github-actions Bot added the documentation label Feb 12, 2026

jinchen62 changed the base branch from develop to hipblaslt_common_cms_phase2 February 12, 2026 03:54

jinchen62 closed this Feb 12, 2026

jinchen62 mentioned this pull request Feb 12, 2026

[hipblaslt][cms] Add BF16 256x96x64 NT CMS with DtlPlusLdsBuf #4535

Merged

jinchen62 deleted the bf16_256x96x64_nt branch February 20, 2026 07:28

Conversation

jinchen62 commented Feb 3, 2026

Motivation

Test Result

Uh oh!

msujon-AMD commented Feb 3, 2026

Uh oh!

msujon-AMD left a comment

Choose a reason for hiding this comment

Uh oh!

jinchen62 commented Feb 4, 2026

Uh oh!

msujon-AMD commented Feb 4, 2026

Uh oh!

jinchen62 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nakajee commented Feb 6, 2026

Uh oh!

msujon-AMD commented Feb 6, 2026

Uh oh!

nakajee commented Feb 6, 2026

Uh oh!

msujon-AMD commented Feb 6, 2026

Uh oh!

jinchen62 commented Feb 7, 2026

Uh oh!

nakajee commented Feb 7, 2026

Uh oh!

nakajee commented Feb 7, 2026

Uh oh!

nakajee commented Feb 7, 2026

Uh oh!

jinchen62 commented Feb 7, 2026

Uh oh!

nakajee commented Feb 7, 2026

Uh oh!

nakajee commented Feb 7, 2026

Uh oh!

jinchen62 commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jinchen62 commented Feb 6, 2026 •

edited

Loading