Enable copies between different devices#3135
Conversation
| CHECK_CUDA(cudaMemcpy(dest.device_write_only(), src.device()+src_offset, num*sizeof(float), cudaMemcpyDeviceToDevice)); | ||
| { | ||
| if (dest.device_id() != src.device_id()) | ||
| CHECK_CUDA(cudaMemcpyPeer(dest.device_write_only(), dest.device_id(), src.device()+src_offset, src.device_id(), num*sizeof(float))); |
There was a problem hiding this comment.
The cuda docs are a little unclear how or if this is different from cudaMemcpy(). What's the difference? My read of the manual is that cudaMemcpy() blocks the host thread util the copy is done but cudaMemcpyPeer doesn't, but other than that they are the same. Or at least the manual doesn't call out any other difference.
The docs in gpu_data_abstract.h for this function should be updated in any case if we do this since this change would make them wrong. I.e. with regard to the blocking behavior.
There was a problem hiding this comment.
I see what you mean from that particular doc link. The latest version of the docs seem to clear this up a bit. In API synchronization behavior, "Synchronous" memory copies include "For transfers from device memory to device memory, no host-side synchronization is performed."
What I understand from this is that cudaMemcpy with cudaMemcpyDeviceToDevice will not block the host thread, but will still synchronize the device work. The function descriptions for cudaMemcpy and cudaMemcpyPeer both include a note stating they exhibit "synchronous" behavior for most use cases.
I think the asynchronous note in the description of cudaMemcpyPeer is reminding you that the function falls in the same category of behavior as cudaMemcpyDeviceToDevice, and to achieve fully async behavior requires the async function.
The docs in gpu_data_abstract.h may already be inaccurate if cudaMemcpyDeviceToDevice doesn't block the host.
There was a problem hiding this comment.
Hey @davisking, Let me know if you are willing to merge this.
I put together an example program (with error checking removed for brevity) that demonstrates the API synchronization behavior as described in the docs I linked in my previous comment. This program measures the function call time as well as the time to explicitly synchronize the host for the various methods.
#include <iostream>
#include <vector>
#include <chrono>
#include <cuda_runtime.h>
using namespace std;
using namespace chrono;
typedef high_resolution_clock clk;
#define DATA_SIZE 100000000
#define ELAPSED(diff) duration_cast<microseconds>(diff).count()
#define TIMEIT(call) \
cudaEventRecord(start); \
t1 = clk::now(); \
call; \
t2 = clk::now(); \
cudaEventRecord(stop); \
cudaDeviceSynchronize(); \
t3 = clk::now(); \
cudaEventElapsedTime(&elapsed, start, stop); \
cout << "function call: " << ELAPSED(t2 - t1) << " microseconds" << endl; \
cout << "device work: " << int(1000 * elapsed) << " microseconds" << endl; \
cout << "time to sync host: " << ELAPSED(t3 - t1) << " microseconds" << endl;
int main()
{
vector<char> host_data;
void *device_0_data = NULL;
void *device_0_data_copy = NULL;
void *device_1_data = NULL;
time_point<clk> t1, t2, t3;
cudaEvent_t start, stop;
float elapsed;
host_data.resize(DATA_SIZE);
cudaSetDevice(0);
cudaMalloc(&device_0_data, DATA_SIZE);
cudaMalloc(&device_0_data_copy, DATA_SIZE);
cudaSetDevice(1);
cudaMalloc(&device_1_data, DATA_SIZE);
cudaSetDevice(0);
cudaDeviceEnablePeerAccess(1, 0);
cudaEventCreate(&start);
cudaEventCreate(&stop);
cout << "host->device" << endl;
TIMEIT(cudaMemcpy(device_0_data, host_data.data(), DATA_SIZE, cudaMemcpyHostToDevice))
cout << endl << "device->device" << endl;
TIMEIT(cudaMemcpy(device_0_data_copy, device_0_data, DATA_SIZE, cudaMemcpyDeviceToDevice))
cout << endl << "device->peer device" << endl;
TIMEIT(cudaMemcpyPeer(device_1_data, 1, device_0_data, 0, DATA_SIZE));
cout << endl << "device->host" << endl;
TIMEIT(cudaMemcpy(host_data.data(), device_0_data, DATA_SIZE, cudaMemcpyDeviceToHost))
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaFree(device_0_data);
cudaFree(device_0_data_copy);
cudaFree(device_1_data);
return 0;
}
I ran this on a machine with two RTX 4500's connected by an NVLink bridge. The output is below.
host->device
function call: 9606 microseconds
device work: 9626 microseconds
time to sync host: 9628 microseconds
device->device
function call: 16 microseconds
device work: 367 microseconds
time to sync host: 370 microseconds
device->peer device
function call: 10 microseconds
device work: 1930 microseconds
time to sync host: 1938 microseconds
device->host
function call: 4817 microseconds
device work: 4818 microseconds
time to sync host: 4822 microseconds
As you can see, the function call time and time to sync is nearly the same for both host->device and device->host transfers. This is consistent with the expected host blocking behavior.
However, cudaMemcpy for device->device and cudaMemcpyPeer return immediately and the time to sync is much longer. I believe this supports the conclusion that cudaMemcpyDeviceToDevice does not block the host and introducing cudaMemcpyPeer would not affect the current dlib behavior. This does mean that the docs for dlib::memcpy are already inaccurate in regards to host blocking.
I also want to note that I also disabled peer access for the above test and cudaMemcpyPeer behaves the same except the transfer time is much longer because it is not using the NVLink connection.
There was a problem hiding this comment.
Yeah let me look this over tonight or tomorrow.
There was a problem hiding this comment.
Yeah the dlib doc is a little confusing. It does say this at the top of the gpu_data_abstract.h file (but I only looked at the comment on the memcpy() function and forgot about this detail):
All transfers to the device happen asynchronously with respect to the
default CUDA stream so that CUDA kernel computations can overlap with data
transfers. However, any transfers from the device to the host happen
synchronously in the default CUDA stream. Therefore, you should perform
all your CUDA kernel launches on the default stream so that transfers back
to the host do not happen before the relevant computations have completed.
Although I still don't understand the purpose of this PR. I've been reading the cuda docs for a while and near as I can tell cudaMemcpy(..., cudaMemcpyDeviceToDevice) should be the same as cudaMemcpyPeer() here. What am I missing?
I'm assuming there was something you were trying to do that doesn't work with the current dlib code but this makes it work? What is that thing?
There was a problem hiding this comment.
I created this PR because my code uses dlib::memcpy to copy a tensor on device 0 to device 1. This resulted in an illegal memory access error (even with peer access enabled). My CUDA version is 12.6.
After researching the issue, I had found that cudaMemcpy(..., cudaMemcpyDeviceToDevice) only supports copies within a device and that you must use cudaMemcpyPeer if you want to perform copies to another device. However, it turns out that you shouldn't have to use cudaMemcpyPeer with Unified Addressing enabled (which I do). I now believe this was due to a driver bug because I tried to reproduce the problem yesterday with an updated driver and there is no error.
The 12.6 documentation didn't mention the behavior of cudaMemcpyDeviceToDevice with Unified Addressing (it only mentions cudaMemcpyDefault). The 13.2 Programming Guide calls it out specifically in Explicit Memory Management.
So, the difference between the two functions is that cudaMemcpyPeer is required for systems (or applications) without Unified Addressing enabled. In my specific case, it was likely just a driver bug.
There was a problem hiding this comment.
Yeah it would only be for ancient setups (like 15 years old now maybe, I forget) where this would matter. Since unified addressing has been around forever at this point.
And yeah, nvidia driver bugs are the worst. I broke my nvidia drivers updating a python package this weekend while testing some other dlib thing and then everything got wonky. You cough on the drivers and something breaks :|
There was a problem hiding this comment.
Yeah, definitely. I need to start assuming its their fault first instead of my code or dlib :)
I do have cards that are 15 years old, but I'm not currently running dlib on them, so I'll close this since my immediate problem is solved.
Thanks again!
When peer access is enabled between two devices, kernel launches remain the same and the pointers to different devices can be dereferenced. However, memory copies require
cudaMemcpyPeer.