Skip to content

[hipblaslt][cms] Add BF16 256x96x64 NT CMS#4207

Closed
jinchen62 wants to merge 1 commit intoROCm:hipblaslt_common_cms_phase2from
jinchen62:bf16_256x96x64_nt
Closed

[hipblaslt][cms] Add BF16 256x96x64 NT CMS#4207
jinchen62 wants to merge 1 commit intoROCm:hipblaslt_common_cms_phase2from
jinchen62:bf16_256x96x64_nt

Conversation

@jinchen62
Copy link
Copy Markdown
Contributor

Motivation

CMS for 256x96x64 NT BF16

Test Result

Test for 4096x1536x8192

Tensile:
Default: 131.721 us
CMS: 126.369 us
Speedup: 4.06%

@msujon-AMD
Copy link
Copy Markdown
Collaborator

Have you tried triple LDS buffer? There is a new parameter in tensilelite to support 3LDS with PGR2: DtlPlusLdsBuf
It will remove the WAR dependency on same LDS buffer at the beginning and you can start GR much early.

Copy link
Copy Markdown
Collaborator

@msujon-AMD msujon-AMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to compare the perf with DtlPlusLdsBuf tuning parameter and may need to reschedule after applying it to start GR early.

@jinchen62
Copy link
Copy Markdown
Contributor Author

@msujon-AMD I think DtlPlusLdsBuf is integrated in develop branch not phase2 branch. How would you suggest to work with it? Do I rebase develop branch or cherry pick the DtlPlusLdsBuf integration?

@msujon-AMD
Copy link
Copy Markdown
Collaborator

@msujon-AMD I think DtlPlusLdsBuf is integrated in develop branch not phase2 branch. How would you suggest to work with it? Do I rebase develop branch or cherry pick the DtlPlusLdsBuf integration?

Let's try it from develop and see if we get better performance. We will sync phase2 branch with develop soon.

@jinchen62
Copy link
Copy Markdown
Contributor Author

jinchen62 commented Feb 6, 2026

@msujon-AMD I tried it from develop branch and got
no cms: 131us
no cms + DtlPlusLdsBuf: 127us
cms: 126us

but I have been dealing with numeric issue for cms + DtlPlusLdsBuf. Something seems suspicious to me. When I turned on DtlPlusLdsBuf, LWSB requires 0 instruction from 1, LRSA requires 4 from 1, LWSA requires 3 from 1. Not sure if I missed some understanding of DtlPlusLdsBuf, could you pls explain more and what's the actual changes we could do with it?

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented Feb 6, 2026

@msujon-AMD I tried it from develop branch and got no cms: 131us no cms + DtlPlusLdsBuf: 127us cms: 126us

but I have been dealing with numeric issue for cms + DtlPlusLdsBuf. Something seems suspicious to me. When I turned on DtlPlusLdsBuf, LWSB requires 0 instruction from 1, LRSA requires 4 from 1, LWSA requires 3 from 1. Not sure if I missed some understanding of DtlPlusLdsBuf, could you pls explain more and what's the actual changes we could do with it?

That is expected.
With DtlPlusLdsBuf, LDS buffer number is more than 2.
We are no longer able to use xor swap operation to switch LDS buffer.
Instead, we have LDS buffer inc addr for write and read separately (common for A and B).

We need 3 operations for rotating LDS buffer.
Local Read case, we need to update Vgpr offset for A and B separately (1 + 1 = 2 v operations).
In total,
LocalWrite: 3 instructions (in total for A and B)
LocalRead: 5 instructions (3 (common) + 1 (A) + 1 (B))

@msujon-AMD
Copy link
Copy Markdown
Collaborator

@msujon-AMD I tried it from develop branch and got no cms: 131us no cms + DtlPlusLdsBuf: 127us cms: 126us
but I have been dealing with numeric issue for cms + DtlPlusLdsBuf. Something seems suspicious to me. When I turned on DtlPlusLdsBuf, LWSB requires 0 instruction from 1, LRSA requires 4 from 1, LWSA requires 3 from 1. Not sure if I missed some understanding of DtlPlusLdsBuf, could you pls explain more and what's the actual changes we could do with it?

That is expected. With DtlPlusLdsBuf, LDS buffer number is more than 2. We are no longer able to use xor swap operation to switch LDS buffer. Instead, we have LDS buffer inc addr for write and read separately (common for A and B).

We need 3 operations for rotating LDS buffer. Local Read case, we need to update Vgpr offset for A and B separately (1 + 1 = 2 v operations). In total, LocalWrite: 3 instructions (in total for A and B) LocalRead: 5 instructions (3 (common) + 1 (A) + 1 (B))

So, the trade off is extra VALU insts vs better scheduling of GR!
@jinchen62, can try implementing CMS from 3LDSBuffer code and see if you can beat your previous CMS code?

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented Feb 6, 2026

So, the trade off is extra VALU insts vs better scheduling of GR! @jinchen62, can try implementing CMS from 3LDSBuffer code and see if you can beat your previous CMS code?

It is small thing, but LDS offset rotation is scalar operation, not vector.
My expectation is that most of efficiency loss comes from global read wait in mid/small MT case.

@msujon-AMD
Copy link
Copy Markdown
Collaborator

So, the trade off is extra VALU insts vs better scheduling of GR! @jinchen62, can try implementing CMS from 3LDSBuffer code and see if you can beat your previous CMS code?

It is small thing, but LDS offset rotation is scalar operation, not vector. My expectation is that most of efficiency loss comes from global read wait in mid/small MT case.

That's actually a very good news! We have extra slot right after MFMA to schedule SALU instruction. That means, we should be able to overlap the extra instructions completely and should not have any exposed cycles :)

@jinchen62
Copy link
Copy Markdown
Contributor Author

@msujon-AMD @nakajee I was not able to get much improvement. I got 125.1 us which is about 1% more. Scheduling GRB earlier might perform worse or cause numeric issue. The following is the cms with DtlPlusLdsBuf.

syncTable = [
    -1, SWaitCnt(dscnt=0, vlcnt=-1, vscnt=-1, comment="wait for prior local read local write old=0, new=0 newLW=0 newLR=0 for iteration == 0"),
    9, SWaitCnt(dscnt=0, vlcnt=-1, vscnt=-1, comment=""),
    9, SBarrier(comment=""),
    20, SWaitCnt(dscnt=0, vlcnt=-1, vscnt=-1, comment="wait for prior local read local write old=0, new=0 newLW=0 newLR=0"),
    26, SWaitCnt(dscnt=0, vlcnt=11, vscnt=-1, comment="wait for previous set of global reads"),
    26, SBarrier(comment=""),
    41, SWaitCnt(dscnt=-1, vlcnt=11, vscnt=-1, comment="wait for previous set of global reads"),
    41, SBarrier(comment=""),
]
optSchedule = {
    'SYNC': [syncTable[::2]],
    'GRIncA': [[0, 0, 0, 1, 1, 1, 2, 2, 2]],
    'GRIncB': [[3, 3, 3, 4, 4, 4, 5, 5, 5]],

    'LRA0': [[0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8]],
    'LRB0': [[9, 11, 13, 15, 17, 19]],

    'GRA': [[10, 10, 12, 12, 14, 14, 16, 16, 18, 18, 20, 20, 22, 22, 24, 24]],
    'GRB': [[29, 29, 32, 32, 35, 35]],

    'LRA1': [[27, 27, 28, 28, 30, 30, 31, 31, 33, 33, 34, 34, 36, 37, 38, 39]],
    'LRB1': [[42, 43, 44, 45, 46, 47]],

    'LRSA': [[25, 25, 25, 25]],
    'LRSB': [[25]],
    'LWSA': [[40, 40, 40]],
    'LWSB': [[]],
    'LCC': [[47, 47]],
}

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented Feb 7, 2026

@msujon-AMD @nakajee I was not able to get much improvement. I got 125.1 us which is about 1% more. Scheduling GRB earlier might perform worse or cause numeric issue. The following is the cms with DtlPlusLdsBuf.

Thanks for your update.
I just realized DtlPlusLdsBuf is disabled if CMS is enabled....
We need to update tensilelite to enable DtlPlusLdsBuf for CMSn

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented Feb 7, 2026

@msujon-AMD @nakajee I was not able to get much improvement. I got 125.1 us which is about 1% more. Scheduling GRB earlier might perform worse or cause numeric issue. The following is the cms with DtlPlusLdsBuf.

Thanks for your update. I just realized DtlPlusLdsBuf is disabled if CMS is enabled.... We need to update tensilelite to enable DtlPlusLdsBuf for CMSn

I realized PGR3 and DtlPlusLdsBuf enablement change has not been merged into cms_phase2 branch yet.
I will create a PR to enable PGR3 and DPB on develop branch.

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented Feb 7, 2026

I realized PGR3 and DtlPlusLdsBuf enablement change has not been merged into cms_phase2 branch yet. I will create a PR to enable PGR3 and DPB on develop branch.

I created a new PR.
#4395

If possible, please try it to see if we can improve perf with DtlPlusLdsBuf

@jinchen62
Copy link
Copy Markdown
Contributor Author

@nakajee Actually I did comment out the line of enforcing “use cms” == 0 with DtlPlusLdsBuf, made sure it’s using cms.

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented Feb 7, 2026

@nakajee Actually I did comment out the line of enforcing “use cms” == 0 with DtlPlusLdsBuf, made sure it’s using cms.

OK. Thanks. That means you used develop branch, right?
By the way, since this is CMS, you need to manually schedule GRA and GRB forward.
With DtlPlusLdsBuf, you can start GRA right after GRIncA (means from 3).

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented Feb 7, 2026

Plus, you do not need sync before you start GRA

@jinchen62
Copy link
Copy Markdown
Contributor Author

@nakajee Got it. I will be able to try tomorrow evening.

@jinchen62 jinchen62 changed the base branch from hipblaslt_common_cms_phase2 to develop February 12, 2026 03:53
@jinchen62 jinchen62 requested a review from a team as a code owner February 12, 2026 03:53
@jinchen62 jinchen62 changed the base branch from develop to hipblaslt_common_cms_phase2 February 12, 2026 03:54
@jinchen62 jinchen62 closed this Feb 12, 2026
@jinchen62 jinchen62 deleted the bf16_256x96x64_nt branch February 20, 2026 07:28
jinchen62 added a commit that referenced this pull request Mar 4, 2026
## Motivation

CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf 
Open a new PR from #4207

## Test Result

On MI350

**Tensile, no CMS vs CMS**
MNK = 4096,1536,8192
- Time: 1.69% improvement
- Efficiency: 60.9% --> 62.8%

**Bench, Baseline vs CMS**
MNK = 4096,1536,8192
- Time: 7.02% improvement
- Efficiency: 60.9% --> 62.8%


AIGECORE-92

---------

Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>
NaveenElumalaiAMD pushed a commit that referenced this pull request Mar 6, 2026
## Motivation

CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf 
Open a new PR from #4207

## Test Result

On MI350

**Tensile, no CMS vs CMS**
MNK = 4096,1536,8192
- Time: 1.69% improvement
- Efficiency: 60.9% --> 62.8%

**Bench, Baseline vs CMS**
MNK = 4096,1536,8192
- Time: 7.02% improvement
- Efficiency: 60.9% --> 62.8%


AIGECORE-92

---------

Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>
jovanau pushed a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
)

## Motivation

CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf 
Open a new PR from ROCm#4207

## Test Result

On MI350

**Tensile, no CMS vs CMS**
MNK = 4096,1536,8192
- Time: 1.69% improvement
- Efficiency: 60.9% --> 62.8%

**Bench, Baseline vs CMS**
MNK = 4096,1536,8192
- Time: 7.02% improvement
- Efficiency: 60.9% --> 62.8%


AIGECORE-92

---------

Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>
johannes-graner pushed a commit that referenced this pull request Mar 20, 2026
## Motivation

CMS for 256x96x64 NT BF16 with DtlPlusLdsBuf 
Open a new PR from #4207

## Test Result

On MI350

**Tensile, no CMS vs CMS**
MNK = 4096,1536,8192
- Time: 1.69% improvement
- Efficiency: 60.9% --> 62.8%

**Bench, Baseline vs CMS**
MNK = 4096,1536,8192
- Time: 7.02% improvement
- Efficiency: 60.9% --> 62.8%


AIGECORE-92

---------

Co-authored-by: Eugene Mezhibovsky <emezhibo@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants