Skip to content

Feat(shmem): implement GET (remote read) device API with blocking and non-blocking#193

Merged
jhchouuu merged 3 commits intomainfrom
jiahzhou/shmem_get_develop
Mar 20, 2026
Merged

Feat(shmem): implement GET (remote read) device API with blocking and non-blocking#193
jhchouuu merged 3 commits intomainfrom
jiahzhou/shmem_get_develop

Conversation

@jhchouuu
Copy link
Copy Markdown
Collaborator

Motivation

MORI shmem currently only supports PUT (remote write) operations. To align with the OpenSHMEM standard and enable use cases like remote data fetch in MoE combine kernels, this PR adds GET (remote read) — the counterpart to PUT. GET allows a PE to read data from a remote PE's symmetric heap into its own local memory, complementing the existing PUT-based communication model.

Technical Details

Non-blocking GET (ShmemGetMemNbi*)

  • P2P transport: direct GPU memory copy from peerPtrs[pe] to local, using ThreadCopy/WarpCopy/BlockCopy
  • IBGDA/RDMA transport: posts RDMA READ WQEs via PostRead (NIC fetches remote data to local GPU memory); user must call ShmemQuietThread before accessing data
  • SDMA transport: delegates to P2P (native SDMA GET is TODO)
  • Supports both SymmMemObjPtr-based and pure address-based APIs
  • Thread/Warp/Block scope, typed variants for all 11 data types

Blocking GET (ShmemGetMem*)

  • Wraps GetMemNbi + ShmemQuietThread — data is ready when the call returns
  • Follows OpenSHMEM shmem_getmem / shmem_TYPE_get semantics

PostReadWrite template refactor

  • Unified PostWrite/PostRead into PostReadWrite<PrvdType, IsRead> compile-time template across MLX5, BNXT, and Ionic providers
  • Opcode selection is now constexpr instead of runtime branch
  • PostWrite/PostRead retained as inline wrappers for backward compatibility

Triton IR integration

  • C wrapper functions (mori_shmem_getmem_nbi_*, mori_shmem_getmem_*) compiled to device bitcode
  • Python mori.ir.ops metadata for @core.extern Triton kernel integration

Test Plan

C++ tests (concurrent_get_thread, via mpirun -np 2)

Test Description
Test 1 Nbi GET Legacy API (SymmMemObjPtr, Thread scope)
Test 1B Nbi GET Legacy API (Block scope)
Test 2 Nbi GET Pure Address API (Thread scope)
Test 2B Nbi GET Pure Address API (Block scope)
Test 3 Nbi GET Large Multi-Chunk (>200MB, VMM chunk boundary crossing)
Test 4 Blocking GET Legacy API (Thread scope)
Test 5 Blocking GET Pure Address API (Thread scope)

Triton bitcode tests (torchrun --nproc_per_node=2)

Test Description
shmem_get_nbi_kernel Non-blocking GET via getmem_nbi_thread + quiet_thread
shmem_get_blocking_kernel Blocking GET via getmem_thread (no separate quiet)

Test matrix (all PASSED)

MLX5 P2P MLX5 IBGDA BNXT P2P BNXT IBGDA AINIC P2P AINIC IBGDA
C++ Nbi GET (5 tests)
C++ Blocking GET (2 tests)
Triton get_nbi
Triton get_blocking
  • Hardware: MLX5 + MI300X, BNXT Thor2 + MI325X, AINIC Pollara + MI350X
  • Modes tested: Static Heap, VMM Heap (MORI_SHMEM_MODE=VMM_HEAP)

Implement shmem GET interfaces mirroring the existing PUT API pattern:
- P2P transport: direct GPU memory copy from remote peer
- IBGDA/RDMA transport: RDMA READ via PostRead
  - MLX5 (Mellanox): implemented and tested
  - BNXT (Broadcom) / PSD (Pensando): compile-time dispatch added but untested
- SDMA transport: delegated to P2P (TODO: native SDMA GET)
- Both SymmMemObjPtr and pure address-based APIs
- Thread/Warp/Block scope for all transports
- Typed variants for all 11 data types
- C wrapper (bitcode) for Triton IR integration
- Triton test and C++ mpirun test (concurrent_get_thread)

Tested: P2P, IBGDA (MLX5 only), Static Heap, VMM Heap, Triton bitcode
… IsRead> template

Replace runtime bool dispatch with compile-time template parameter for
RDMA read/write WQE construction across all three providers:
- MLX5: Mlx5PostReadWriteImpl<IsRead>, opcode now constexpr
- BNXT: BnxtPostReadWriteImpl<IsRead>, opcode now constexpr
- Ionic/PSD: IonicPostReadWriteImpl<IsRead>, opcode now constexpr

PostWrite/PostRead retained as inline wrappers for backward compatibility.

Tested: PUT + GET, P2P + IBGDA, on MLX5, BNXT, and AINIC (Ionic)
Made-with: Cursor
Add blocking (synchronous) GET that internally calls Nbi + quiet,
following the OpenSHMEM shmem_getmem / shmem_TYPE_get semantics:
- SymmMemObjPtr-based and pure address-based APIs
- Thread/Warp/Block scope for all variants
- Typed variants for all 11 data types
- C wrapper (bitcode) for Triton IR integration
- Python IR ops metadata for Triton @core.extern

Tests added:
- C++ mpirun: blocking GET legacy + address API tests
- Triton: shmem_get_blocking_kernel (getmem_thread)

Tested: P2P + IBGDA on MLX5, BNXT, AINIC (all PASSED)

Note: shmem_g (single-value GET returning T) not yet implemented —
requires ibuf mechanism for IBGDA RDMA READ destination buffer.

Made-with: Cursor
@jhchouuu jhchouuu merged commit 8a4c3a1 into main Mar 20, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant