Feat(shmem): implement GET (remote read) device API with blocking and non-blocking#193
Merged
Feat(shmem): implement GET (remote read) device API with blocking and non-blocking#193
Conversation
Implement shmem GET interfaces mirroring the existing PUT API pattern: - P2P transport: direct GPU memory copy from remote peer - IBGDA/RDMA transport: RDMA READ via PostRead - MLX5 (Mellanox): implemented and tested - BNXT (Broadcom) / PSD (Pensando): compile-time dispatch added but untested - SDMA transport: delegated to P2P (TODO: native SDMA GET) - Both SymmMemObjPtr and pure address-based APIs - Thread/Warp/Block scope for all transports - Typed variants for all 11 data types - C wrapper (bitcode) for Triton IR integration - Triton test and C++ mpirun test (concurrent_get_thread) Tested: P2P, IBGDA (MLX5 only), Static Heap, VMM Heap, Triton bitcode
… IsRead> template Replace runtime bool dispatch with compile-time template parameter for RDMA read/write WQE construction across all three providers: - MLX5: Mlx5PostReadWriteImpl<IsRead>, opcode now constexpr - BNXT: BnxtPostReadWriteImpl<IsRead>, opcode now constexpr - Ionic/PSD: IonicPostReadWriteImpl<IsRead>, opcode now constexpr PostWrite/PostRead retained as inline wrappers for backward compatibility. Tested: PUT + GET, P2P + IBGDA, on MLX5, BNXT, and AINIC (Ionic) Made-with: Cursor
Add blocking (synchronous) GET that internally calls Nbi + quiet, following the OpenSHMEM shmem_getmem / shmem_TYPE_get semantics: - SymmMemObjPtr-based and pure address-based APIs - Thread/Warp/Block scope for all variants - Typed variants for all 11 data types - C wrapper (bitcode) for Triton IR integration - Python IR ops metadata for Triton @core.extern Tests added: - C++ mpirun: blocking GET legacy + address API tests - Triton: shmem_get_blocking_kernel (getmem_thread) Tested: P2P + IBGDA on MLX5, BNXT, AINIC (all PASSED) Note: shmem_g (single-value GET returning T) not yet implemented — requires ibuf mechanism for IBGDA RDMA READ destination buffer. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
MORI shmem currently only supports PUT (remote write) operations. To align with the OpenSHMEM standard and enable use cases like remote data fetch in MoE combine kernels, this PR adds GET (remote read) — the counterpart to PUT. GET allows a PE to read data from a remote PE's symmetric heap into its own local memory, complementing the existing PUT-based communication model.
Technical Details
Non-blocking GET (
ShmemGetMemNbi*)peerPtrs[pe]to local, usingThreadCopy/WarpCopy/BlockCopyPostRead(NIC fetches remote data to local GPU memory); user must callShmemQuietThreadbefore accessing dataSymmMemObjPtr-based and pure address-based APIsBlocking GET (
ShmemGetMem*)GetMemNbi + ShmemQuietThread— data is ready when the call returnsshmem_getmem/shmem_TYPE_getsemanticsPostReadWrite template refactor
PostWrite/PostReadintoPostReadWrite<PrvdType, IsRead>compile-time template across MLX5, BNXT, and Ionic providersconstexprinstead of runtime branchPostWrite/PostReadretained as inline wrappers for backward compatibilityTriton IR integration
mori_shmem_getmem_nbi_*,mori_shmem_getmem_*) compiled to device bitcodemori.ir.opsmetadata for@core.externTriton kernel integrationTest Plan
C++ tests (
concurrent_get_thread, viampirun -np 2)Triton bitcode tests (
torchrun --nproc_per_node=2)shmem_get_nbi_kernelgetmem_nbi_thread+quiet_threadshmem_get_blocking_kernelgetmem_thread(no separate quiet)Test matrix (all PASSED)
MORI_SHMEM_MODE=VMM_HEAP)