Feat(shmem): implement GET (remote read) device API with blocking and non-blocking by jhchouuu · Pull Request #193 · ROCm/mori

jhchouuu · 2026-03-10T08:32:12Z

Motivation

MORI shmem currently only supports PUT (remote write) operations. To align with the OpenSHMEM standard and enable use cases like remote data fetch in MoE combine kernels, this PR adds GET (remote read) — the counterpart to PUT. GET allows a PE to read data from a remote PE's symmetric heap into its own local memory, complementing the existing PUT-based communication model.

Technical Details

Non-blocking GET (`ShmemGetMemNbi*`)

P2P transport: direct GPU memory copy from peerPtrs[pe] to local, using ThreadCopy/WarpCopy/BlockCopy
IBGDA/RDMA transport: posts RDMA READ WQEs via PostRead (NIC fetches remote data to local GPU memory); user must call ShmemQuietThread before accessing data
SDMA transport: delegates to P2P (native SDMA GET is TODO)
Supports both SymmMemObjPtr-based and pure address-based APIs
Thread/Warp/Block scope, typed variants for all 11 data types

Blocking GET (`ShmemGetMem*`)

Wraps GetMemNbi + ShmemQuietThread — data is ready when the call returns
Follows OpenSHMEM shmem_getmem / shmem_TYPE_get semantics

PostReadWrite template refactor

Unified PostWrite/PostRead into PostReadWrite<PrvdType, IsRead> compile-time template across MLX5, BNXT, and Ionic providers
Opcode selection is now constexpr instead of runtime branch
PostWrite/PostRead retained as inline wrappers for backward compatibility

Triton IR integration

C wrapper functions (mori_shmem_getmem_nbi_*, mori_shmem_getmem_*) compiled to device bitcode
Python mori.ir.ops metadata for @core.extern Triton kernel integration

Test Plan

C++ tests (`concurrent_get_thread`, via `mpirun -np 2`)

Test	Description
Test 1	Nbi GET Legacy API (SymmMemObjPtr, Thread scope)
Test 1B	Nbi GET Legacy API (Block scope)
Test 2	Nbi GET Pure Address API (Thread scope)
Test 2B	Nbi GET Pure Address API (Block scope)
Test 3	Nbi GET Large Multi-Chunk (>200MB, VMM chunk boundary crossing)
Test 4	Blocking GET Legacy API (Thread scope)
Test 5	Blocking GET Pure Address API (Thread scope)

Triton bitcode tests (`torchrun --nproc_per_node=2`)

Test	Description
`shmem_get_nbi_kernel`	Non-blocking GET via `getmem_nbi_thread` + `quiet_thread`
`shmem_get_blocking_kernel`	Blocking GET via `getmem_thread` (no separate quiet)

Test matrix (all PASSED)

	MLX5 P2P	MLX5 IBGDA	BNXT P2P	BNXT IBGDA	AINIC P2P	AINIC IBGDA
C++ Nbi GET (5 tests)	✅	✅	✅	✅	✅	✅
C++ Blocking GET (2 tests)	✅	✅	✅	✅	✅	✅
Triton get_nbi	✅	✅	✅	✅	✅	✅
Triton get_blocking	✅	✅	✅	✅	✅	✅

Hardware: MLX5 + MI300X, BNXT Thor2 + MI325X, AINIC Pollara + MI350X
Modes tested: Static Heap, VMM Heap (MORI_SHMEM_MODE=VMM_HEAP)

Implement shmem GET interfaces mirroring the existing PUT API pattern: - P2P transport: direct GPU memory copy from remote peer - IBGDA/RDMA transport: RDMA READ via PostRead - MLX5 (Mellanox): implemented and tested - BNXT (Broadcom) / PSD (Pensando): compile-time dispatch added but untested - SDMA transport: delegated to P2P (TODO: native SDMA GET) - Both SymmMemObjPtr and pure address-based APIs - Thread/Warp/Block scope for all transports - Typed variants for all 11 data types - C wrapper (bitcode) for Triton IR integration - Triton test and C++ mpirun test (concurrent_get_thread) Tested: P2P, IBGDA (MLX5 only), Static Heap, VMM Heap, Triton bitcode

… IsRead> template Replace runtime bool dispatch with compile-time template parameter for RDMA read/write WQE construction across all three providers: - MLX5: Mlx5PostReadWriteImpl<IsRead>, opcode now constexpr - BNXT: BnxtPostReadWriteImpl<IsRead>, opcode now constexpr - Ionic/PSD: IonicPostReadWriteImpl<IsRead>, opcode now constexpr PostWrite/PostRead retained as inline wrappers for backward compatibility. Tested: PUT + GET, P2P + IBGDA, on MLX5, BNXT, and AINIC (Ionic) Made-with: Cursor

Add blocking (synchronous) GET that internally calls Nbi + quiet, following the OpenSHMEM shmem_getmem / shmem_TYPE_get semantics: - SymmMemObjPtr-based and pure address-based APIs - Thread/Warp/Block scope for all variants - Typed variants for all 11 data types - C wrapper (bitcode) for Triton IR integration - Python IR ops metadata for Triton @core.extern Tests added: - C++ mpirun: blocking GET legacy + address API tests - Triton: shmem_get_blocking_kernel (getmem_thread) Tested: P2P + IBGDA on MLX5, BNXT, AINIC (all PASSED) Note: shmem_g (single-value GET returning T) not yet implemented — requires ibuf mechanism for IBGDA RDMA READ destination buffer. Made-with: Cursor

jhchouuu added 3 commits March 10, 2026 02:47

jhchouuu merged commit 8a4c3a1 into main Mar 20, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat(shmem): implement GET (remote read) device API with blocking and non-blocking#193

Feat(shmem): implement GET (remote read) device API with blocking and non-blocking#193
jhchouuu merged 3 commits intomainfrom
jiahzhou/shmem_get_develop

jhchouuu commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jhchouuu commented Mar 10, 2026

Motivation

Technical Details

Non-blocking GET (ShmemGetMemNbi*)

Blocking GET (ShmemGetMem*)

PostReadWrite template refactor

Triton IR integration

Test Plan

C++ tests (concurrent_get_thread, via mpirun -np 2)

Triton bitcode tests (torchrun --nproc_per_node=2)

Test matrix (all PASSED)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Non-blocking GET (`ShmemGetMemNbi*`)

Blocking GET (`ShmemGetMem*`)

C++ tests (`concurrent_get_thread`, via `mpirun -np 2`)

Triton bitcode tests (`torchrun --nproc_per_node=2`)