Skip to content

Enable copies between different devices#3135

Closed
kSkip wants to merge 2 commits intodavisking:masterfrom
kSkip:enable-peer-memcpy
Closed

Enable copies between different devices#3135
kSkip wants to merge 2 commits intodavisking:masterfrom
kSkip:enable-peer-memcpy

Conversation

@kSkip
Copy link
Copy Markdown
Contributor

@kSkip kSkip commented Feb 10, 2026

When peer access is enabled between two devices, kernel launches remain the same and the pointers to different devices can be dereferenced. However, memory copies require cudaMemcpyPeer.

CHECK_CUDA(cudaMemcpy(dest.device_write_only(), src.device()+src_offset, num*sizeof(float), cudaMemcpyDeviceToDevice));
{
if (dest.device_id() != src.device_id())
CHECK_CUDA(cudaMemcpyPeer(dest.device_write_only(), dest.device_id(), src.device()+src_offset, src.device_id(), num*sizeof(float)));
Copy link
Copy Markdown
Owner

@davisking davisking Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuda docs are a little unclear how or if this is different from cudaMemcpy(). What's the difference? My read of the manual is that cudaMemcpy() blocks the host thread util the copy is done but cudaMemcpyPeer doesn't, but other than that they are the same. Or at least the manual doesn't call out any other difference.

The docs in gpu_data_abstract.h for this function should be updated in any case if we do this since this change would make them wrong. I.e. with regard to the blocking behavior.

Copy link
Copy Markdown
Contributor Author

@kSkip kSkip Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean from that particular doc link. The latest version of the docs seem to clear this up a bit. In API synchronization behavior, "Synchronous" memory copies include "For transfers from device memory to device memory, no host-side synchronization is performed."

What I understand from this is that cudaMemcpy with cudaMemcpyDeviceToDevice will not block the host thread, but will still synchronize the device work. The function descriptions for cudaMemcpy and cudaMemcpyPeer both include a note stating they exhibit "synchronous" behavior for most use cases.

I think the asynchronous note in the description of cudaMemcpyPeer is reminding you that the function falls in the same category of behavior as cudaMemcpyDeviceToDevice, and to achieve fully async behavior requires the async function.

The docs in gpu_data_abstract.h may already be inaccurate if cudaMemcpyDeviceToDevice doesn't block the host.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @davisking, Let me know if you are willing to merge this.

I put together an example program (with error checking removed for brevity) that demonstrates the API synchronization behavior as described in the docs I linked in my previous comment. This program measures the function call time as well as the time to explicitly synchronize the host for the various methods.

#include <iostream>
#include <vector>
#include <chrono>
#include <cuda_runtime.h>

using namespace std;
using namespace chrono;

typedef high_resolution_clock clk;

#define DATA_SIZE 100000000

#define ELAPSED(diff) duration_cast<microseconds>(diff).count()

#define TIMEIT(call) \
    cudaEventRecord(start); \
    t1 = clk::now(); \
    call; \
    t2 = clk::now(); \
    cudaEventRecord(stop); \
    cudaDeviceSynchronize(); \
    t3 = clk::now(); \
    cudaEventElapsedTime(&elapsed, start, stop); \
    cout << "function call: " << ELAPSED(t2 - t1) << " microseconds" << endl; \
    cout << "device work: " << int(1000 * elapsed) << " microseconds" << endl; \
    cout << "time to sync host: " << ELAPSED(t3 - t1) << " microseconds" << endl;


int main()
{
    vector<char> host_data;
    void *device_0_data = NULL;
    void *device_0_data_copy = NULL;
    void *device_1_data = NULL;
    time_point<clk> t1, t2, t3;
    cudaEvent_t start, stop;
    float elapsed;

    host_data.resize(DATA_SIZE);

    cudaSetDevice(0);
    cudaMalloc(&device_0_data, DATA_SIZE);
    cudaMalloc(&device_0_data_copy, DATA_SIZE);

    cudaSetDevice(1);
    cudaMalloc(&device_1_data, DATA_SIZE);

    cudaSetDevice(0);
    cudaDeviceEnablePeerAccess(1, 0);

    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cout << "host->device" << endl;
    TIMEIT(cudaMemcpy(device_0_data, host_data.data(), DATA_SIZE, cudaMemcpyHostToDevice))

    cout << endl << "device->device" << endl;
    TIMEIT(cudaMemcpy(device_0_data_copy, device_0_data, DATA_SIZE, cudaMemcpyDeviceToDevice))

    cout << endl << "device->peer device" << endl;
    TIMEIT(cudaMemcpyPeer(device_1_data, 1, device_0_data, 0, DATA_SIZE));

    cout << endl << "device->host" << endl;
    TIMEIT(cudaMemcpy(host_data.data(), device_0_data, DATA_SIZE, cudaMemcpyDeviceToHost))

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(device_0_data);
    cudaFree(device_0_data_copy);
    cudaFree(device_1_data);
    return 0;
}

I ran this on a machine with two RTX 4500's connected by an NVLink bridge. The output is below.

host->device
function call: 9606 microseconds
device work: 9626 microseconds
time to sync host: 9628 microseconds

device->device
function call: 16 microseconds
device work: 367 microseconds
time to sync host: 370 microseconds

device->peer device
function call: 10 microseconds
device work: 1930 microseconds
time to sync host: 1938 microseconds

device->host
function call: 4817 microseconds
device work: 4818 microseconds
time to sync host: 4822 microseconds

As you can see, the function call time and time to sync is nearly the same for both host->device and device->host transfers. This is consistent with the expected host blocking behavior.

However, cudaMemcpy for device->device and cudaMemcpyPeer return immediately and the time to sync is much longer. I believe this supports the conclusion that cudaMemcpyDeviceToDevice does not block the host and introducing cudaMemcpyPeer would not affect the current dlib behavior. This does mean that the docs for dlib::memcpy are already inaccurate in regards to host blocking.

I also want to note that I also disabled peer access for the above test and cudaMemcpyPeer behaves the same except the transfer time is much longer because it is not using the NVLink connection.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let me look this over tonight or tomorrow.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the dlib doc is a little confusing. It does say this at the top of the gpu_data_abstract.h file (but I only looked at the comment on the memcpy() function and forgot about this detail):

                All transfers to the device happen asynchronously with respect to the
                default CUDA stream so that CUDA kernel computations can overlap with data
                transfers.  However, any transfers from the device to the host happen
                synchronously in the default CUDA stream.  Therefore, you should perform
                all your CUDA kernel launches on the default stream so that transfers back
                to the host do not happen before the relevant computations have completed.

Although I still don't understand the purpose of this PR. I've been reading the cuda docs for a while and near as I can tell cudaMemcpy(..., cudaMemcpyDeviceToDevice) should be the same as cudaMemcpyPeer() here. What am I missing?

I'm assuming there was something you were trying to do that doesn't work with the current dlib code but this makes it work? What is that thing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this PR because my code uses dlib::memcpy to copy a tensor on device 0 to device 1. This resulted in an illegal memory access error (even with peer access enabled). My CUDA version is 12.6.

After researching the issue, I had found that cudaMemcpy(..., cudaMemcpyDeviceToDevice) only supports copies within a device and that you must use cudaMemcpyPeer if you want to perform copies to another device. However, it turns out that you shouldn't have to use cudaMemcpyPeer with Unified Addressing enabled (which I do). I now believe this was due to a driver bug because I tried to reproduce the problem yesterday with an updated driver and there is no error.

The 12.6 documentation didn't mention the behavior of cudaMemcpyDeviceToDevice with Unified Addressing (it only mentions cudaMemcpyDefault). The 13.2 Programming Guide calls it out specifically in Explicit Memory Management.

So, the difference between the two functions is that cudaMemcpyPeer is required for systems (or applications) without Unified Addressing enabled. In my specific case, it was likely just a driver bug.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would only be for ancient setups (like 15 years old now maybe, I forget) where this would matter. Since unified addressing has been around forever at this point.

And yeah, nvidia driver bugs are the worst. I broke my nvidia drivers updating a python package this weekend while testing some other dlib thing and then everything got wonky. You cough on the drivers and something breaks :|

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely. I need to start assuming its their fault first instead of my code or dlib :)

I do have cards that are 15 years old, but I'm not currently running dlib on them, so I'll close this since my immediate problem is solved.

Thanks again!

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem :D

Repository owner deleted a comment from dlib-issue-bot Mar 27, 2026
Repository owner deleted a comment from dlib-issue-bot Mar 27, 2026
@kSkip kSkip closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants