Skip to content

Pytorch Symmetric memory backend#6022

Open
saivishal1999 wants to merge 3 commits intoNVIDIA:mainfrom
saivishal1999:symmetric-memory-pytorch-backends
Open

Pytorch Symmetric memory backend#6022
saivishal1999 wants to merge 3 commits intoNVIDIA:mainfrom
saivishal1999:symmetric-memory-pytorch-backends

Conversation

@saivishal1999
Copy link
Collaborator

No description provided.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR adds PyTorch symmetric memory backend support as an alternative to nvfuser's native CUDA VMM implementation. The implementation allows users to choose between Native (default), PyTorchNccl, PyTorchNvshmem, or PyTorchCuda backends via NVFUSER_ENABLE=symmetric_memory_backend(...).

Key changes:

  • Added SymmetricMemoryBackend enum and getSymmetricMemoryBackend() to query the selected backend
  • Extended Communicator with getStore() and getWorldBackendIntrusivePtr() to provide PyTorch integration points
  • Implemented PyTorch backend path in SymmetricTensor::allocate() using c10d::symmetric_memory::empty_strided_p2p
  • Added handle caching mechanism to bridge PyTorch allocation and SymmetricTensor construction
  • Methods like remoteTensor(), multicastPtr() delegate to PyTorch when py_symm_handle_ is set
  • Contiguous view operations explicitly unsupported for PyTorch backend (with clear error messages)
  • Added validation tests and backend selection tests

Issues found:

  • fbuild.sh is a personal build script with hardcoded paths and settings that should not be committed
  • Large block of commented-out test code should be removed or enabled

Confidence Score: 3/5

  • Safe to merge after removing fbuild.sh and addressing commented test code
  • The core implementation is well-structured with proper thread safety (mutex for cache), conditional compilation guards, and clear error messages. However, fbuild.sh must be removed as it's a personal build script, and the commented-out test should be cleaned up. The PyTorch integration logic appears sound but reduces confidence slightly due to the accidental inclusion of development artifacts.
  • fbuild.sh must be removed; tests/cpp/test_multidevice_symmetric_tensor.cpp needs commented code cleaned up

Important Files Changed

Filename Overview
fbuild.sh Personal build script with hardcoded paths and settings - should not be committed
csrc/multidevice/communicator.cpp Implements getStore() and getWorldBackendIntrusivePtr() for PyTorch symmetric memory
csrc/multidevice/ipc_utils.cpp Implements getSymmetricMemoryBackend() to query backend from options
csrc/multidevice/symmetric_tensor.cpp Implements PyTorch backend path with handle caching, allocation, and delegation logic
tests/cpp/test_multidevice_symmetric_tensor.cpp Adds backend tests and validation tests; contains commented-out PyTorch backend test

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[SymmetricTensor::allocate] --> CheckBackend{getSymmetricMemoryBackend}
    
    CheckBackend -->|Native| CheckVMM[Check VMM Support]
    CheckBackend -->|PyTorch*| PyTorchPath[PyTorch Backend Path]
    
    PyTorchPath --> InitBackend[ensurePyTorchSymmMemBackend]
    InitBackend --> SetBackend[set_backend NCCL/NVSHMEM/CUDA]
    SetBackend --> SetGroupInfo[set_group_info with Communicator]
    SetGroupInfo --> ComputeStrides[Compute row-major strides]
    ComputeStrides --> CheckNCCL{Backend == PyTorchNccl?}
    CheckNCCL -->|Yes| AllocNoGroup[empty_strided_p2p with nullopt group]
    CheckNCCL -->|No| AllocWithGroup[empty_strided_p2p with nvfuser_symm group]
    AllocNoGroup --> Rendezvous[c10d::symmetric_memory::rendezvous]
    AllocWithGroup --> Rendezvous
    Rendezvous --> CacheHandle[Cache handle by data_ptr]
    CacheHandle --> ReturnTensor[Return tensor]
    
    CheckVMM --> NativeAlloc[Native CUDA VMM allocation]
    NativeAlloc --> ReturnTensor
    
    ReturnTensor --> UserCode[User constructs SymmetricTensor]
    UserCode --> Constructor[SymmetricTensor constructor]
    Constructor --> LookupCache{Check handle cache}
    LookupCache -->|Found| MovePyHandle[Move py_symm_handle_ from cache]
    LookupCache -->|Not found| NativeSetup[Native IPC setup]
    MovePyHandle --> RemoteAccess[Remote tensor access via PyTorch]
    NativeSetup --> RemoteAccess
Loading

Last reviewed commit: 14816aa

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +1 to +24
#!/bin/bash

export CC=clang-20
export CXX=clang++-20
export LDFLAGS="-fuse-ld=mold"

export NVFUSER_BUILD_ENABLE_PCH

export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"

# export TORCH_CUDA_ARCH_LIST="9.0"

export NVFUSER_BUILD_WITH_UCC=1
export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY

# Enable debug mode, leave empty for non-debug compilation
export NVFUSER_BUILD_BUILD_TYPE=Debug
export RUN_CMAKE=""

pip install -v -e ./python --no-build-isolation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a personal build script with hardcoded paths (/opt/hpcx/), specific compiler versions (clang-20), and debug settings. It should not be committed to the repository.

Suggested change
#!/bin/bash
export CC=clang-20
export CXX=clang++-20
export LDFLAGS="-fuse-ld=mold"
export NVFUSER_BUILD_ENABLE_PCH
export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"
# export TORCH_CUDA_ARCH_LIST="9.0"
export NVFUSER_BUILD_WITH_UCC=1
export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY
# Enable debug mode, leave empty for non-debug compilation
export NVFUSER_BUILD_BUILD_TYPE=Debug
export RUN_CMAKE=""
pip install -v -e ./python --no-build-isolation
# Remove this file - it should not be part of the PR

Comment on lines +125 to +163
// TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) {
// if (communicator_->size() == 1) {
// GTEST_SKIP() << "Skipping test for single device";
// }
// SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
// if (backend == SymmetricMemoryBackend::Native) {
// GTEST_SKIP()
// << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run";
// }

// const int64_t rank = communicator_->deviceId();
// const int64_t world_size = communicator_->size();

// at::Tensor local_tensor = SymmetricTensor::allocate(
// {256, 512}, at::ScalarType::Float, communicator_->device());
// SymmetricTensor sym_tensor(local_tensor);

// EXPECT_TRUE(local_tensor.is_cuda());
// EXPECT_EQ(local_tensor.numel(), 256 * 512);

// float local_value = static_cast<float>(rank + 200);
// local_tensor.fill_(local_value);

// sym_tensor.setupRemoteHandles();

// for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
// void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
// EXPECT_NE(peer_ptr, nullptr);

// float peer_value;
// NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
// &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));

// float expected_value = static_cast<float>(peer_rank + 200);
// EXPECT_FLOAT_EQ(peer_value, expected_value)
// << "Rank " << rank << " reading from rank " << peer_rank
// << " (PyTorch backend)";
// }
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large block of commented-out test code. Either remove it or uncomment and enable it for testing the PyTorch backend path.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant