Skip to content

Support async mode for shm allreduce#484

Open
gaopengff wants to merge 8 commits intopytorch:mainfrom
gaopengff:gaopengf/support_torch_async
Open

Support async mode for shm allreduce#484
gaopengff wants to merge 8 commits intopytorch:mainfrom
gaopengff:gaopengf/support_torch_async

Conversation

@gaopengff
Copy link
Contributor

This is to fix CI failure of pytorch in bump PR pytorch/pytorch#172297.
In async mode shmData should be occupied exclusively. We added lock for shmData to make it thread safe and used unique tag to do synchronization among different ranks.

@meta-cla meta-cla bot added the CLA Signed label Jan 20, 2026
@gaopengff
Copy link
Contributor Author

@d4l3k Could you help review this?

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


std::shared_ptr<AllreduceSharedMemoryData> shmData;

std::mutex shmDataMutex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there's multiple gloo process groups? Does that cause issues at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can we put this under shmData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For multiple process groups scenario, I used gloo context's address to generate unique ID for shm name. In that case different group will use different shm buffer to do allreduce op. I've verified with a test with multiple process groups with pytorch and it passed.
  2. I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.

We could potentially move this to allreduce_shm.cc and make both static? that way we keep global context clean of shm specifics. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works only on multi-processes scenario if we make them static. However, in multi-threads scenario like gloo unit test, each thread will represent a rank and the shm_data is only initialized once, which is not as expected.
Although there is a keywork thread_local to make static variable unique among threads, it may cause performance issue. In real workload such as Pytorch the calling sequence is more like:

context init -> call allreduce -> call allreduce -> call allreduce

thread_local will make shm data initialized every time calling allreduce. The initialization is very expensive as it will allocate shm buffer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread_local will make shm data initialized every time calling allreduce

is that correct? I think it will be initialized only the first time but I see your point. Okay I think we can at least wrap this in the macro you created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've wrapped shm_data's declaration in the macro I created.

@kapilsh kapilsh self-requested a review March 7, 2026 11:09
Copy link

@kapilsh kapilsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks looks great - I left a few comments. can we also add some tests for covering the new shm all_reduce and possibly resource cleanup tests for shm


#if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__)
if (context->isIntraNode() && !context->getDevice()->hasGPUDirect()) {
algorithm = detail::AllreduceOptionsImpl::SHM;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see users to be able to use explicit algorithm - this will override anything user explicitly specifies. should we check Algorithm::UNSPECIFIED before we override?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified it to make sure it will override Algorithm::UNSPECIFIED only when shm allreduce is applicable. Also I added unit test for shm allreduce in gloo/test/allreduce_test.cc


std::shared_ptr<AllreduceSharedMemoryData> shmData;

std::mutex shmDataMutex;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.

We could potentially move this to allreduce_shm.cc and make both static? that way we keep global context clean of shm specifics. wdyt?

#include <array>
#include <cstring>

#if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be copied in a bunch of places - can we make this a macro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've defined a macro of this in gloo/allreduce.h, which will be used in unit test too,

#if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__)
#define GLOO_SHM_ALLREDUCE_APPLICABLE 1
#else
#define GLOO_SHM_ALLREDUCE_APPLICABLE 0
#endif

@gaopengff gaopengff requested a review from kapilsh March 10, 2026 06:22
@meta-codesync
Copy link

meta-codesync bot commented Mar 10, 2026

@kapilsh has imported this pull request. If you are a Meta employee, you can view this in D95938037.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants