Conversation
|
@d4l3k Could you help review this? |
|
|
||
| std::shared_ptr<AllreduceSharedMemoryData> shmData; | ||
|
|
||
| std::mutex shmDataMutex; |
There was a problem hiding this comment.
What happens if there's multiple gloo process groups? Does that cause issues at all?
There was a problem hiding this comment.
also can we put this under shmData?
There was a problem hiding this comment.
- For multiple process groups scenario, I used gloo context's address to generate unique ID for shm name. In that case different group will use different shm buffer to do allreduce op. I've verified with a test with multiple process groups with pytorch and it passed.
- I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.
There was a problem hiding this comment.
I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.
We could potentially move this to allreduce_shm.cc and make both static? that way we keep global context clean of shm specifics. wdyt?
There was a problem hiding this comment.
It works only on multi-processes scenario if we make them static. However, in multi-threads scenario like gloo unit test, each thread will represent a rank and the shm_data is only initialized once, which is not as expected.
Although there is a keywork thread_local to make static variable unique among threads, it may cause performance issue. In real workload such as Pytorch the calling sequence is more like:
context init -> call allreduce -> call allreduce -> call allreducethread_local will make shm data initialized every time calling allreduce. The initialization is very expensive as it will allocate shm buffer.
There was a problem hiding this comment.
thread_local will make shm data initialized every time calling allreduce
is that correct? I think it will be initialized only the first time but I see your point. Okay I think we can at least wrap this in the macro you created.
There was a problem hiding this comment.
OK, I've wrapped shm_data's declaration in the macro I created.
kapilsh
left a comment
There was a problem hiding this comment.
Thanks looks great - I left a few comments. can we also add some tests for covering the new shm all_reduce and possibly resource cleanup tests for shm
|
|
||
| #if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__) | ||
| if (context->isIntraNode() && !context->getDevice()->hasGPUDirect()) { | ||
| algorithm = detail::AllreduceOptionsImpl::SHM; |
There was a problem hiding this comment.
I dont see users to be able to use explicit algorithm - this will override anything user explicitly specifies. should we check Algorithm::UNSPECIFIED before we override?
There was a problem hiding this comment.
I've modified it to make sure it will override Algorithm::UNSPECIFIED only when shm allreduce is applicable. Also I added unit test for shm allreduce in gloo/test/allreduce_test.cc
|
|
||
| std::shared_ptr<AllreduceSharedMemoryData> shmData; | ||
|
|
||
| std::mutex shmDataMutex; |
There was a problem hiding this comment.
I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.
We could potentially move this to allreduce_shm.cc and make both static? that way we keep global context clean of shm specifics. wdyt?
gloo/allreduce.cc
Outdated
| #include <array> | ||
| #include <cstring> | ||
|
|
||
| #if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__) |
There was a problem hiding this comment.
this seems to be copied in a bunch of places - can we make this a macro?
There was a problem hiding this comment.
Sure, I've defined a macro of this in gloo/allreduce.h, which will be used in unit test too,
#if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__)
#define GLOO_SHM_ALLREDUCE_APPLICABLE 1
#else
#define GLOO_SHM_ALLREDUCE_APPLICABLE 0
#endif
This is to fix CI failure of pytorch in bump PR pytorch/pytorch#172297.
In async mode shmData should be occupied exclusively. We added lock for shmData to make it thread safe and used unique tag to do synchronization among different ranks.