Conversation
|
Review updated until commit 6996d05 Description
|
| Relevant files | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Enhancement | 6 files
| ||||||||||||
| Configuration changes | |||||||||||||
| Tests | 1 files
| ||||||||||||
| Miscellaneous | 1 files
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 PR contains tests |
| ⚡ Recommended focus areas for review |
Silent fallback to Native backend
getSymmetricMemoryBackend() silently falls back to Native instead of reporting an error. This could mask user configuration mistakes. Consider adding validation to warn or error on unknown backend arguments. |
Greptile SummaryThis PR adds three new PyTorch-backed symmetric memory allocation strategies ( Key finding:
Confidence Score: 2/5
Last reviewed commit: 6996d05 |
| #!/bin/bash | ||
|
|
||
| export CC=clang-20 | ||
| export CXX=clang++-20 | ||
| export LDFLAGS="-fuse-ld=mold" | ||
|
|
||
| export NVFUSER_BUILD_ENABLE_PCH | ||
|
|
||
| export UCC_HOME="/opt/hpcx/ucc" | ||
| export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc" | ||
| export UCX_HOME="/opt/hpcx/ucx" | ||
| export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx" | ||
|
|
||
| # export TORCH_CUDA_ARCH_LIST="9.0" | ||
|
|
||
| export NVFUSER_BUILD_WITH_UCC=1 | ||
| export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser | ||
| export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY | ||
|
|
||
| # Enable debug mode, leave empty for non-debug compilation | ||
| export NVFUSER_BUILD_BUILD_TYPE=Debug | ||
| export RUN_CMAKE="" | ||
|
|
||
| pip install -v -e ./python --no-build-isolation |
There was a problem hiding this comment.
Personal developer build script committed to repository
This script contains machine-specific, hardcoded toolchain paths that are unlikely to work anywhere except the author's development machine:
clang-20andclang++-20— not a standard compiler version available broadly-fuse-ld=mold— requires themoldlinker to be installed/opt/hpcx/uccand/opt/hpcx/ucx— HPC-X installation path specific to the author's environment$BUILD_DIRECTORYis used but never validated; if it is unset,NVFUSER_BUILD_INSTALL_DIRandNVFUSER_BUILD_DIRwill silently be empty strings, likely breaking the build
This kind of personal convenience script should live outside version control (e.g., in a .gitignore-d directory or in the author's home directory). Committing it to the main repo risks confusing other contributors and cluttering the root directory.
| void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) { | ||
| static std::once_flag once; | ||
| std::call_once(once, [backend]() { | ||
| const char* name = nullptr; | ||
| switch (backend) { | ||
| case SymmetricMemoryBackend::PyTorchNccl: | ||
| name = "NCCL"; | ||
| break; | ||
| case SymmetricMemoryBackend::PyTorchNvshmem: | ||
| name = "NVSHMEM"; | ||
| break; | ||
| case SymmetricMemoryBackend::PyTorchCuda: | ||
| name = "CUDA"; | ||
| break; | ||
| default: | ||
| NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend"); | ||
| } | ||
| c10d::symmetric_memory::set_backend(name); | ||
| Communicator& comm = Communicator::getInstance(); | ||
| NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory"); | ||
| c10d::symmetric_memory::set_group_info( | ||
| kPyTorchSymmMemGroupName, | ||
| static_cast<int>(comm.deviceId()), | ||
| static_cast<int>(comm.size()), | ||
| comm.getStore()); | ||
| }); | ||
| } |
There was a problem hiding this comment.
NCCL backend initialization is incomplete — register_process_group is never called
ensurePyTorchSymmMemBackend calls set_group_info but never calls c10d::register_process_group. According to the comment added to communicator.h for getWorldBackendIntrusivePtr:
Returns the world backend as an intrusive_ptr so it can be registered with
c10d::register_process_group(e.g. for PyTorch symmetric memory NCCL rendezvous, which resolves the group by name).
getWorldBackendIntrusivePtr was clearly introduced to supply the backend for this registration, yet the call to c10d::register_process_group is absent from ensurePyTorchSymmMemBackend. PyTorch's NCCL symmetric-memory rendezvous resolves the process group by name at the point it is called; without a prior register_process_group(kPyTorchSymmMemGroupName, ...), the NCCL backend path will fail to locate the group and throw at rendezvous time.
The missing call should be something like:
// After set_group_info, for NCCL backend:
c10d::register_process_group(
kPyTorchSymmMemGroupName,
comm.getWorldBackendIntrusivePtr(CommunicatorBackend::kNccl));The fact that getWorldBackendIntrusivePtr was added in this exact PR but is never invoked strongly suggests this step was accidentally left out.
| // TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) { | ||
| // if (communicator_->size() == 1) { | ||
| // GTEST_SKIP() << "Skipping test for single device"; | ||
| // } | ||
| // SymmetricMemoryBackend backend = getSymmetricMemoryBackend(); | ||
| // if (backend == SymmetricMemoryBackend::Native) { | ||
| // GTEST_SKIP() | ||
| // << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run"; | ||
| // } | ||
|
|
||
| // const int64_t rank = communicator_->deviceId(); | ||
| // const int64_t world_size = communicator_->size(); | ||
|
|
||
| // at::Tensor local_tensor = SymmetricTensor::allocate( | ||
| // {256, 512}, at::ScalarType::Float, communicator_->device()); | ||
| // SymmetricTensor sym_tensor(local_tensor); | ||
|
|
||
| // EXPECT_TRUE(local_tensor.is_cuda()); | ||
| // EXPECT_EQ(local_tensor.numel(), 256 * 512); | ||
|
|
||
| // float local_value = static_cast<float>(rank + 200); | ||
| // local_tensor.fill_(local_value); | ||
|
|
||
| // sym_tensor.setupRemoteHandles(); | ||
|
|
||
| // for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) { | ||
| // void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr(); | ||
| // EXPECT_NE(peer_ptr, nullptr); | ||
|
|
||
| // float peer_value; | ||
| // NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy( | ||
| // &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost)); | ||
|
|
||
| // float expected_value = static_cast<float>(peer_rank + 200); | ||
| // EXPECT_FLOAT_EQ(peer_value, expected_value) | ||
| // << "Rank " << rank << " reading from rank " << peer_rank | ||
| // << " (PyTorch backend)"; | ||
| // } | ||
| // } |
There was a problem hiding this comment.
Entire PyTorch backend correctness test is commented out
PyTorchBackend_RemoteAccessCorrectness is the only test that exercises the new PyTorch backend path end-to-end (allocation → rendezvous → remote access). Leaving it commented out means the three new backend variants (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) have zero test coverage in CI.
The comment says it should be run manually with NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl), but that means regressions in the PyTorch path will go undetected in normal CI runs.
If the test can't pass yet (e.g., because the NCCL register_process_group call is missing), that's a strong signal to fix the underlying issue rather than suppress the test. If the test is intentionally deferred, consider converting it into a proper GTEST_SKIP with an explanatory message so the intent is visible to reviewers and CI.
| std::vector<int64_t> strides(sizes.size()); | ||
| strides.back() = 1; | ||
| for (int64_t i = (int64_t)strides.size() - 2; i >= 0; --i) { |
There was a problem hiding this comment.
Undefined behavior when sizes is empty (0-dim tensor)
std::vector<int64_t> strides(sizes.size());
strides.back() = 1; // UB if sizes is emptystd::vector::back() on an empty vector is undefined behaviour. The same guard-free pattern also exists in the native path further down in the same function (~line 225). While allocating a 0-dimensional symmetric tensor is unusual, the PyTorch path that was just added adds a new callsite where callers may pass {} as sizes. A simple check is sufficient:
NVF_CHECK(!sizes.empty(), "Cannot allocate a 0-dim symmetric tensor");or initialise strides defensively (matching the standard row-major convention for 0-dim tensors, which is an empty strides vector) and skip the loop entirely when sizes is empty.
|
Sorry! I accidentally hit the button to merge main into the branch. Hopefully it's ok. |
| void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) { | ||
| static std::once_flag once; | ||
| std::call_once(once, [backend]() { | ||
| const char* name = nullptr; | ||
| switch (backend) { | ||
| case SymmetricMemoryBackend::PyTorchNccl: | ||
| name = "NCCL"; | ||
| break; | ||
| case SymmetricMemoryBackend::PyTorchNvshmem: | ||
| name = "NVSHMEM"; | ||
| break; | ||
| case SymmetricMemoryBackend::PyTorchCuda: | ||
| name = "CUDA"; | ||
| break; | ||
| default: | ||
| NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend"); | ||
| } | ||
| c10d::symmetric_memory::set_backend(name); | ||
| Communicator& comm = Communicator::getInstance(); | ||
| NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory"); | ||
| c10d::symmetric_memory::set_group_info( | ||
| kPyTorchSymmMemGroupName, | ||
| static_cast<int>(comm.deviceId()), | ||
| static_cast<int>(comm.size()), | ||
| comm.getStore()); | ||
| }); | ||
| } |
There was a problem hiding this comment.
std::call_once exception-safety leaves set_backend in a permanently broken state on retry
std::call_once resets its once_flag if the callable exits via an exception, allowing a subsequent call to retry. However, the callable here calls set_backend(name) before set_group_info(...). If set_backend succeeds but set_group_info subsequently throws (e.g., because the store is unavailable), once_flag is reset and the next allocate() call will attempt set_backend(name) a second time. PyTorch's symmetric memory layer is likely to throw on that second set_backend call (backend already configured), making it impossible to recover without restarting the process.
A straightforward mitigation is to separate the two calls into distinct phases or to wrap set_backend in its own protection:
// Separate once-flags for each idempotent step, or catch and suppress
// the "already set" error from set_backend on retry:
try {
c10d::symmetric_memory::set_backend(name);
} catch (const std::exception& e) {
// If the backend is already set to the correct name, treat as success.
// Re-throw otherwise.
}
c10d::symmetric_memory::set_group_info(
kPyTorchSymmMemGroupName, ...);Alternatively, split the once_flag so set_backend has its own dedicated guard that truly runs at most once, while set_group_info can retry on failure.
No description provided.