Skip to content

Add FP8/BF8 support for LDS transpose load#2210

Open
stefankoncarevic wants to merge 12 commits intodevelopfrom
lds-transpose-load-fp8
Open

Add FP8/BF8 support for LDS transpose load#2210
stefankoncarevic wants to merge 12 commits intodevelopfrom
lds-transpose-load-fp8

Conversation

@stefankoncarevic
Copy link
Copy Markdown
Contributor

@stefankoncarevic stefankoncarevic commented Jan 20, 2026

⚠️ Do not merge until #2242 and #2246 is merged

Resolves:

Implement ds_read_tr8_b64 offset formulas for FP8/BF8 MFMA (16x32, 32x16, 16x128, 32x64). Enable mixed fp8/bf8 type combinations for GEMM operations on gfx950.

Motivation

Add FP8 and BF8 data type support for LDS transpose load optimization on gfx950.
This enables efficient matrix loads using ds_read_tr8_b64 hardware instruction

Technical Details

  • LdsTransposeLoad.cpp: Implemented FP8/BF8 offset formulas in getBasePanelOffsets()
  • LdsTransposeLoad.cpp: Updated type compatibility check in makeDecision()
  • Added areBothFp8Types() check to allow mixed fp8/bf8 combinations

Test Plan

  • Add e2e tests for FP8/BF8 GEMM
  • Add e2e tests for mixed fp8/bf8 combinations

Test Result

@stefankoncarevic stefankoncarevic marked this pull request as draft January 20, 2026 12:05
@stefankoncarevic stefankoncarevic marked this pull request as ready for review January 20, 2026 14:35
@stefankoncarevic stefankoncarevic changed the title [WIP] Add FP8/BF8 support for LDS transpose load Add FP8/BF8 support for LDS transpose load Jan 20, 2026
Base automatically changed from lds-transpose-load-attention to develop January 27, 2026 15:07
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 2 times, most recently from f3176a8 to a75ab7a Compare January 29, 2026 14:10
@stefankoncarevic stefankoncarevic changed the base branch from develop to scaled-fp8-mfma-gfx950 February 24, 2026 12:53
@stefankoncarevic stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 2 times, most recently from eecc935 to 4f657da Compare March 3, 2026 08:54
@stefankoncarevic stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 5 times, most recently from 185a1c7 to 1587b86 Compare March 26, 2026 12:47
@stefankoncarevic stefankoncarevic force-pushed the scaled-fp8-mfma-gfx950 branch 3 times, most recently from 196438c to 0aad9ec Compare March 26, 2026 22:13
Base automatically changed from scaled-fp8-mfma-gfx950 to develop March 27, 2026 01:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds FP8/BF8 enablement for Rock’s LDS transpose-load path on gfx950, including support for scaled FP8 MFMA geometries that use ds_read_tr8_b64, and expands testing coverage accordingly.

Changes:

  • Extend LDS transpose-load utility/lowering to handle FP8/BF8 element types, including quad-rate scaled FP8 geometries (16x128, 32x64).
  • Update LDS transpose MFMA-geometry validation and rock.lds_transpose_load op verification for 8-bit types (vector length = 8).
  • Add new e2e + MLIR tests covering FP8/BF8 and mixed fp8/bf8 combinations.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/utility/LdsTransposeLoad.cpp Implements FP8/BF8 lane→offset mapping, quad-rate load sequencing, and type-compatibility updates.
mlir/lib/Dialect/Rock/IR/RockDialect.cpp Extends rock.lds_transpose_load verifier to accept FP8/BF8 and enforce expected vector length.
mlir/include/mlir/Dialect/Rock/IR/RockOps.td Broadens op type constraints to allow FP8/BF8 memrefs and non-fixed vector results (checked in verifier).
mlir/lib/Dialect/Rock/IR/RockDialect.cpp Updates LDS transpose config geometry validation error message.
mlir/include/mlir/Dialect/Rock/utility/LdsTransposeLoad.h Updates public header docs to reflect new types/geometries.
mlir/include/mlir/Dialect/Rock/IR/RockAttrDefs.td Updates KDim documentation to include 64/128.
mlir/lib/Dialect/Rock/IR/RockDialect.cpp Updates LDS transpose config verification to allow (16,128) and (32,64).
mlir/test/Dialect/Rock/ops.mlir Adds FP8 rock.lds_transpose_load + new config-attr test cases.
mlir/test/Dialect/Rock/lowering_load_transpose_lds.mlir Adds lowering checks for FP8 transpose-load legalization.
mlir/test/Dialect/Rock/lds_transpose_error.mlir Updates expected error text for expanded valid geometry set.
mlir/test/e2e/PrLdsTransposeLoadFp8.toml New e2e suite for FP8/BF8 LDS transpose-load (including mixed-type cases).
mlir/test/e2e/PrLdsTransposeLoadFp8.cfg Gating for the new e2e suite based on lds_transpose_load feature.
mlir/test/e2e/CMakeLists.txt Wires the new e2e suite into CMake.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/test/Dialect/Rock/ops.mlir
Comment thread mlir/test/Dialect/Rock/ops.mlir
Comment thread mlir/test/Dialect/Rock/ops.mlir Outdated
Comment thread mlir/test/Dialect/Rock/ops.mlir Outdated
Comment thread mlir/lib/Dialect/Rock/IR/RockDialect.cpp Outdated
Comment thread mlir/test/e2e/PrLdsTransposeLoadFp8.toml Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds gfx950 FP8/BF8 enablement for the LDS transpose-load optimization path (ds_read_tr8_b64), including support for scaled FP8 MFMA geometries and mixed fp8/bf8 GEMM type combinations, plus accompanying IR + e2e tests.

Changes:

  • Extend LDS transpose load utility to support FP8/BF8 (vector length 8) and scaled FP8 MFMA geometries (16x128, 32x64) with quad-rate load sequencing.
  • Update Rock dialect verification/constraints and tests to cover new supported geometries and element types.
  • Add new FP8/BF8 e2e regression suite and hook it into the e2e CMake target list.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/utility/LdsTransposeLoad.cpp Core implementation: FP8/BF8 support, new geometry handling, quad-rate offset/load logic
mlir/lib/Dialect/Rock/IR/RockDialect.cpp Verifier updates: LDS transpose config geometry list; LDS transpose load result vector length checks
mlir/include/mlir/Dialect/Rock/IR/RockOps.td Expand operand element types; loosen result type to be enforced by verifier
mlir/include/mlir/Dialect/Rock/utility/LdsTransposeLoad.h Header/docs updated for new types and geometries
mlir/include/mlir/Dialect/Rock/IR/RockAttrDefs.td Doc update: KDim now includes 64/128
mlir/test/Dialect/Rock/ops.mlir New IR tests covering fp8 lds_transpose_load and new config attrs
mlir/test/Dialect/Rock/lowering_load_transpose_lds.mlir Lowering checks for fp8 transpose_load generation
mlir/test/Dialect/Rock/lds_transpose_error.mlir Update expected error messages for expanded valid geometry list
mlir/test/e2e/PrLdsTransposeLoadFp8.toml New e2e suite covering fp8/bf8 + mixed combos
mlir/test/e2e/PrLdsTransposeLoadFp8.cfg Feature-gate for the new e2e suite
mlir/test/e2e/CMakeLists.txt Register new e2e suite in the PR test list
Comments suppressed due to low confidence (1)

mlir/lib/Dialect/Rock/utility/LdsTransposeLoad.cpp:170

  • isValidMfmaGeometry() now treats scaled FP8 geometries (16x128, 32x64) as generally valid, but the non-FP8 lane mapping path in getBasePanelOffsets() does not handle these and will hit llvm_unreachable if they ever occur with f16/bf16 types. To avoid potential compiler crashes on inconsistent inputs, add an element-type-dependent geometry check in makeDecision() (or before calling getBasePanelOffsets) so 16x128/32x64 are only accepted when the element type is FP8/BF8.
  // Validate MFMA geometry
  if (!isValidMfmaGeometry(shape.mnMfma, shape.kMfma)) {
    return dec;
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mlir/lib/Dialect/Rock/utility/LdsTransposeLoad.cpp Outdated
Comment thread mlir/include/mlir/Dialect/Rock/IR/RockOps.td Outdated
stefankoncarevic and others added 12 commits April 30, 2026 10:32
- Remove duplicate entries in getMfmaInsnInfoMap
- Clarify neutral scale creation comment in AccelEmitter.cpp
- Rename zeroAttr to neutralScaleAttr for clarity
Implement ds_read_tr8_b64 offset formulas for FP8/BF8 MFMA (16x32, 32x16).
Enable mixed fp8/bf8 type combinations for GEMM operations on gfx950.
Disable LDS transpose for FP8 GEMM when K >= 1280 or small square
matrices (K == N < 512) to avoid performance regressions while
preserving compile time benefits.Add FP8 GEMM heuristic to selectively disable LDS transpose
- Add (32,64) geometry support in LdsTransposeLoad.cpp:
  - New getBasePanelOffsets() branch for 32x64 quad-rate formula
  - k_block = block_id / 2, m_block = block_id % 2
  - kOffsetBase = k_local + k_block * 32
  - mOffsetBase = m_parity * 8 + m_block * 16
- Update isQuadRate detection to include 32x64
- Add validation for (32,64) in RockDialect.cpp
- Extend tuning ranges for scaled FP8 testing:
  - kPackPerBlock: added 64
  - kPack: added 32 (for k_base=32)

Co-authored-by: Cursor <cursoragent@cursor.com>
… check in LDSTransposeLoadOp::verify, update suite 4 test count comment
exact set of valid vector types and drop the now-redundant verifier
checks. Add ldsTransposeConfig structural checks to
ThreadwiseReadIntoOp::verify (rank-1 + static dest, supported element
type, (geometry, type) consistency). Replace the assert + .value() in
emitThreadwiseHWTranspose with emitOpError to avoid UB in release
builds and reject non-rank-1 / dynamic destinations up-front. Use
AmdArchInfo::hasLdsTransposeLoad for arch gating, share a single
isValidLdsTransposeMfmaGeometry helper, align the numWaves formula
with computeWaveGridLayout, drop the dead  (tuning only
emits power-of-2 wave-tile factors), and refresh doc comments. Add
four negative ODS-coverage tests for the result-type constraint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants