Skip to content

Symmetric memory pytorch backends#6023

Open
saivishal1999 wants to merge 4 commits intomainfrom
symmetric-memory-pytorch-backends
Open

Symmetric memory pytorch backends#6023
saivishal1999 wants to merge 4 commits intomainfrom
symmetric-memory-pytorch-backends

Conversation

@saivishal1999
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Review updated until commit 6996d05

Description

  • Add PyTorch symmetric memory backends (NCCL, NVSHMEM, CUDA) as alternatives to native VMM

  • Implement getSymmetricMemoryBackend() to select backend via NVFUSER_ENABLE=symmetric_memory_backend option

  • Integrate PyTorch's c10d::symmetric_memory for allocation, rendezvous, and remote tensor access

  • Add Communicator methods to expose Store and Backend for PyTorch symmetric memory integration

Changes walkthrough

Relevant files
Enhancement
6 files
ipc_utils.h
Add SymmetricMemoryBackend enum and getter                             
+13/-0   
ipc_utils.cpp
Implement getSymmetricMemoryBackend option parsing             
+18/-0   
symmetric_tensor.h
Add PyTorch symmetric memory handle member                             
+15/-6   
symmetric_tensor.cpp
Implement PyTorch backend allocation and remote access     
+162/-1 
communicator.h
Declare getStore and getWorldBackendIntrusivePtr                 
+13/-0   
communicator.cpp
Implement getStore and getWorldBackendIntrusivePtr             
+16/-0   
Configuration changes
2 files
options.h
Add SymmetricMemoryBackend to EnableOption enum                   
+2/-0     
options.cpp
Register symmetric_memory_backend enable option                   
+1/-0     
Tests
1 files
test_multidevice_symmetric_tensor.cpp
Add tests for symmetric memory backend selection                 
+108/-0 
Miscellaneous
1 files
fbuild.sh
Add build script for development                                                 
+24/-0   

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Silent fallback to Native backend

When an invalid argument is passed to symmetric_memory_backend option (e.g., "pytorch_invalid"),
getSymmetricMemoryBackend() silently falls back to Native instead of reporting an error.
This could mask user configuration mistakes. Consider adding validation to warn or error
on unknown backend arguments.

SymmetricMemoryBackend getSymmetricMemoryBackend() {
  if (isOptionEnabled(EnableOption::SymmetricMemoryBackend)) {
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_nccl")) {
      return SymmetricMemoryBackend::PyTorchNccl;
    }
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_nvshmem")) {
      return SymmetricMemoryBackend::PyTorchNvshmem;
    }
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_cuda")) {
      return SymmetricMemoryBackend::PyTorchCuda;
    }
  }
  return SymmetricMemoryBackend::Native;
}
PyTorch backend tests commented out

The test PyTorchBackend_RemoteAccessCorrectness (lines 125-163) is commented out. Since this
PR introduces PyTorch symmetric memory backends, having at least one active test for the
non-native paths would be valuable to ensure correctness. Consider enabling or adding an
alternative test for the PyTorch backend path.

// TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) {
//   if (communicator_->size() == 1) {
//     GTEST_SKIP() << "Skipping test for single device";
//   }
//   SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
//   if (backend == SymmetricMemoryBackend::Native) {
//     GTEST_SKIP()
//         << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run";
//   }

//   const int64_t rank = communicator_->deviceId();
//   const int64_t world_size = communicator_->size();

//   at::Tensor local_tensor = SymmetricTensor::allocate(
//       {256, 512}, at::ScalarType::Float, communicator_->device());
//   SymmetricTensor sym_tensor(local_tensor);

//   EXPECT_TRUE(local_tensor.is_cuda());
//   EXPECT_EQ(local_tensor.numel(), 256 * 512);

//   float local_value = static_cast<float>(rank + 200);
//   local_tensor.fill_(local_value);

//   sym_tensor.setupRemoteHandles();

//   for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
//     void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
//     EXPECT_NE(peer_ptr, nullptr);

//     float peer_value;
//     NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
//         &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));

//     float expected_value = static_cast<float>(peer_rank + 200);
//     EXPECT_FLOAT_EQ(peer_value, expected_value)
//         << "Rank " << rank << " reading from rank " << peer_rank
//         << " (PyTorch backend)";
//   }
// }
Unnecessary build script added

A new file fbuild.sh was added which appears to be a local development/build script with
hardcoded paths (e.g., /opt/hpcx/ucc). This should likely be removed from the PR as it's
not part of the feature implementation and contains machine-specific configuration.

#!/bin/bash

export CC=clang-20
export CXX=clang++-20
export LDFLAGS="-fuse-ld=mold"

export NVFUSER_BUILD_ENABLE_PCH

export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"

# export TORCH_CUDA_ARCH_LIST="9.0"

export NVFUSER_BUILD_WITH_UCC=1
export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY

# Enable debug mode, leave empty for non-debug compilation
export NVFUSER_BUILD_BUILD_TYPE=Debug
export RUN_CMAKE=""

pip install -v -e ./python --no-build-isolation

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR adds three new PyTorch-backed symmetric memory allocation strategies (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) alongside the existing native CUDA VMM path, selectable via NVFUSER_ENABLE=symmetric_memory_backend(...). The integration wires c10d::symmetric_memory::empty_strided_p2p and rendezvous into SymmetricTensor::allocate, stores the returned handle in a process-wide cache, and delegates remoteTensor, multicastPtr, and lifecycle management to the PyTorch handle when it is present.

Key finding:

  • call_once exception-safety (symmetric_tensor.cpp:46): set_backend() and set_group_info() share the same once_flag. If set_backend succeeds but set_group_info throws, the flag is reset and the next call retries — including set_backend again, which PyTorch may reject as "already configured", making recovery impossible.

Confidence Score: 2/5

  • Not safe to merge — the new PyTorch backend initialization has a call_once exception-safety issue that could result in permanent breakage on error recovery, and existing findings indicate additional problems with incomplete integration and zero automated end-to-end test coverage.
  • The call_once exception-safety issue is a concrete correctness bug where partial initialization failure leaves the system in an unrecoverable state. Combined with the already-reported issues (missing register_process_group call, commented-out end-to-end test, undefined behavior on 0-dim tensors), the new PyTorch backend path cannot be proven correct in CI.
  • csrc/multidevice/symmetric_tensor.cpp (initialization logic and call_once), tests/cpp/test_multidevice_symmetric_tensor.cpp (commented-out end-to-end test)

Last reviewed commit: 6996d05

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +1 to +24
#!/bin/bash

export CC=clang-20
export CXX=clang++-20
export LDFLAGS="-fuse-ld=mold"

export NVFUSER_BUILD_ENABLE_PCH

export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"

# export TORCH_CUDA_ARCH_LIST="9.0"

export NVFUSER_BUILD_WITH_UCC=1
export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY

# Enable debug mode, leave empty for non-debug compilation
export NVFUSER_BUILD_BUILD_TYPE=Debug
export RUN_CMAKE=""

pip install -v -e ./python --no-build-isolation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal developer build script committed to repository

This script contains machine-specific, hardcoded toolchain paths that are unlikely to work anywhere except the author's development machine:

  • clang-20 and clang++-20 — not a standard compiler version available broadly
  • -fuse-ld=mold — requires the mold linker to be installed
  • /opt/hpcx/ucc and /opt/hpcx/ucx — HPC-X installation path specific to the author's environment
  • $BUILD_DIRECTORY is used but never validated; if it is unset, NVFUSER_BUILD_INSTALL_DIR and NVFUSER_BUILD_DIR will silently be empty strings, likely breaking the build

This kind of personal convenience script should live outside version control (e.g., in a .gitignore-d directory or in the author's home directory). Committing it to the main repo risks confusing other contributors and cluttering the root directory.

Comment on lines +46 to +72
void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) {
static std::once_flag once;
std::call_once(once, [backend]() {
const char* name = nullptr;
switch (backend) {
case SymmetricMemoryBackend::PyTorchNccl:
name = "NCCL";
break;
case SymmetricMemoryBackend::PyTorchNvshmem:
name = "NVSHMEM";
break;
case SymmetricMemoryBackend::PyTorchCuda:
name = "CUDA";
break;
default:
NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
}
c10d::symmetric_memory::set_backend(name);
Communicator& comm = Communicator::getInstance();
NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory");
c10d::symmetric_memory::set_group_info(
kPyTorchSymmMemGroupName,
static_cast<int>(comm.deviceId()),
static_cast<int>(comm.size()),
comm.getStore());
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL backend initialization is incomplete — register_process_group is never called

ensurePyTorchSymmMemBackend calls set_group_info but never calls c10d::register_process_group. According to the comment added to communicator.h for getWorldBackendIntrusivePtr:

Returns the world backend as an intrusive_ptr so it can be registered with c10d::register_process_group (e.g. for PyTorch symmetric memory NCCL rendezvous, which resolves the group by name).

getWorldBackendIntrusivePtr was clearly introduced to supply the backend for this registration, yet the call to c10d::register_process_group is absent from ensurePyTorchSymmMemBackend. PyTorch's NCCL symmetric-memory rendezvous resolves the process group by name at the point it is called; without a prior register_process_group(kPyTorchSymmMemGroupName, ...), the NCCL backend path will fail to locate the group and throw at rendezvous time.

The missing call should be something like:

// After set_group_info, for NCCL backend:
c10d::register_process_group(
    kPyTorchSymmMemGroupName,
    comm.getWorldBackendIntrusivePtr(CommunicatorBackend::kNccl));

The fact that getWorldBackendIntrusivePtr was added in this exact PR but is never invoked strongly suggests this step was accidentally left out.

Comment on lines +125 to +163
// TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) {
// if (communicator_->size() == 1) {
// GTEST_SKIP() << "Skipping test for single device";
// }
// SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
// if (backend == SymmetricMemoryBackend::Native) {
// GTEST_SKIP()
// << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run";
// }

// const int64_t rank = communicator_->deviceId();
// const int64_t world_size = communicator_->size();

// at::Tensor local_tensor = SymmetricTensor::allocate(
// {256, 512}, at::ScalarType::Float, communicator_->device());
// SymmetricTensor sym_tensor(local_tensor);

// EXPECT_TRUE(local_tensor.is_cuda());
// EXPECT_EQ(local_tensor.numel(), 256 * 512);

// float local_value = static_cast<float>(rank + 200);
// local_tensor.fill_(local_value);

// sym_tensor.setupRemoteHandles();

// for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
// void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
// EXPECT_NE(peer_ptr, nullptr);

// float peer_value;
// NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
// &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));

// float expected_value = static_cast<float>(peer_rank + 200);
// EXPECT_FLOAT_EQ(peer_value, expected_value)
// << "Rank " << rank << " reading from rank " << peer_rank
// << " (PyTorch backend)";
// }
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entire PyTorch backend correctness test is commented out

PyTorchBackend_RemoteAccessCorrectness is the only test that exercises the new PyTorch backend path end-to-end (allocation → rendezvous → remote access). Leaving it commented out means the three new backend variants (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) have zero test coverage in CI.

The comment says it should be run manually with NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl), but that means regressions in the PyTorch path will go undetected in normal CI runs.

If the test can't pass yet (e.g., because the NCCL register_process_group call is missing), that's a strong signal to fix the underlying issue rather than suppress the test. If the test is intentionally deferred, consider converting it into a proper GTEST_SKIP with an explanatory message so the intent is visible to reviewers and CI.

Comment on lines +150 to +152
std::vector<int64_t> strides(sizes.size());
strides.back() = 1;
for (int64_t i = (int64_t)strides.size() - 2; i >= 0; --i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undefined behavior when sizes is empty (0-dim tensor)

std::vector<int64_t> strides(sizes.size());
strides.back() = 1;   // UB if sizes is empty

std::vector::back() on an empty vector is undefined behaviour. The same guard-free pattern also exists in the native path further down in the same function (~line 225). While allocating a 0-dimensional symmetric tensor is unusual, the PyTorch path that was just added adds a new callsite where callers may pass {} as sizes. A simple check is sufficient:

NVF_CHECK(!sizes.empty(), "Cannot allocate a 0-dim symmetric tensor");

or initialise strides defensively (matching the standard row-major convention for 0-dim tensors, which is an empty strides vector) and skip the loop entirely when sizes is empty.

@nsarka
Copy link
Member

nsarka commented Mar 3, 2026

Sorry! I accidentally hit the button to merge main into the branch. Hopefully it's ok.

Comment on lines +46 to +72
void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) {
static std::once_flag once;
std::call_once(once, [backend]() {
const char* name = nullptr;
switch (backend) {
case SymmetricMemoryBackend::PyTorchNccl:
name = "NCCL";
break;
case SymmetricMemoryBackend::PyTorchNvshmem:
name = "NVSHMEM";
break;
case SymmetricMemoryBackend::PyTorchCuda:
name = "CUDA";
break;
default:
NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
}
c10d::symmetric_memory::set_backend(name);
Communicator& comm = Communicator::getInstance();
NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory");
c10d::symmetric_memory::set_group_info(
kPyTorchSymmMemGroupName,
static_cast<int>(comm.deviceId()),
static_cast<int>(comm.size()),
comm.getStore());
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::call_once exception-safety leaves set_backend in a permanently broken state on retry

std::call_once resets its once_flag if the callable exits via an exception, allowing a subsequent call to retry. However, the callable here calls set_backend(name) before set_group_info(...). If set_backend succeeds but set_group_info subsequently throws (e.g., because the store is unavailable), once_flag is reset and the next allocate() call will attempt set_backend(name) a second time. PyTorch's symmetric memory layer is likely to throw on that second set_backend call (backend already configured), making it impossible to recover without restarting the process.

A straightforward mitigation is to separate the two calls into distinct phases or to wrap set_backend in its own protection:

// Separate once-flags for each idempotent step, or catch and suppress
// the "already set" error from set_backend on retry:
try {
  c10d::symmetric_memory::set_backend(name);
} catch (const std::exception& e) {
  // If the backend is already set to the correct name, treat as success.
  // Re-throw otherwise.
}
c10d::symmetric_memory::set_group_info(
    kPyTorchSymmMemGroupName, ...);

Alternatively, split the once_flag so set_backend has its own dedicated guard that truly runs at most once, while set_group_info can retry on failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants