Enable copies between different devices by kSkip · Pull Request #3135 · davisking/dlib

kSkip · 2026-02-10T14:27:03Z

When peer access is enabled between two devices, kernel launches remain the same and the pointers to different devices can be dereferenced. However, memory copies require cudaMemcpyPeer.

davisking · 2026-02-15T01:34:20Z

dlib/cuda/gpu_data.cpp

-                    CHECK_CUDA(cudaMemcpy(dest.device_write_only(), src.device()+src_offset,  num*sizeof(float), cudaMemcpyDeviceToDevice));
+                {
+                    if (dest.device_id() != src.device_id())
+                        CHECK_CUDA(cudaMemcpyPeer(dest.device_write_only(), dest.device_id(), src.device()+src_offset, src.device_id(), num*sizeof(float)));


The cuda docs are a little unclear how or if this is different from cudaMemcpy(). What's the difference? My read of the manual is that cudaMemcpy() blocks the host thread util the copy is done but cudaMemcpyPeer doesn't, but other than that they are the same. Or at least the manual doesn't call out any other difference.

The docs in gpu_data_abstract.h for this function should be updated in any case if we do this since this change would make them wrong. I.e. with regard to the blocking behavior.

I see what you mean from that particular doc link. The latest version of the docs seem to clear this up a bit. In API synchronization behavior, "Synchronous" memory copies include "For transfers from device memory to device memory, no host-side synchronization is performed."

What I understand from this is that cudaMemcpy with cudaMemcpyDeviceToDevice will not block the host thread, but will still synchronize the device work. The function descriptions for cudaMemcpy and cudaMemcpyPeer both include a note stating they exhibit "synchronous" behavior for most use cases.

I think the asynchronous note in the description of cudaMemcpyPeer is reminding you that the function falls in the same category of behavior as cudaMemcpyDeviceToDevice, and to achieve fully async behavior requires the async function.

The docs in gpu_data_abstract.h may already be inaccurate if cudaMemcpyDeviceToDevice doesn't block the host.

Hey @davisking, Let me know if you are willing to merge this.

I put together an example program (with error checking removed for brevity) that demonstrates the API synchronization behavior as described in the docs I linked in my previous comment. This program measures the function call time as well as the time to explicitly synchronize the host for the various methods.

#include <iostream> #include <vector> #include <chrono> #include <cuda_runtime.h> using namespace std; using namespace chrono; typedef high_resolution_clock clk; #define DATA_SIZE 100000000 #define ELAPSED(diff) duration_cast<microseconds>(diff).count() #define TIMEIT(call) \ cudaEventRecord(start); \ t1 = clk::now(); \ call; \ t2 = clk::now(); \ cudaEventRecord(stop); \ cudaDeviceSynchronize(); \ t3 = clk::now(); \ cudaEventElapsedTime(&elapsed, start, stop); \ cout << "function call: " << ELAPSED(t2 - t1) << " microseconds" << endl; \ cout << "device work: " << int(1000 * elapsed) << " microseconds" << endl; \ cout << "time to sync host: " << ELAPSED(t3 - t1) << " microseconds" << endl; int main() { vector<char> host_data; void *device_0_data = NULL; void *device_0_data_copy = NULL; void *device_1_data = NULL; time_point<clk> t1, t2, t3; cudaEvent_t start, stop; float elapsed; host_data.resize(DATA_SIZE); cudaSetDevice(0); cudaMalloc(&device_0_data, DATA_SIZE); cudaMalloc(&device_0_data_copy, DATA_SIZE); cudaSetDevice(1); cudaMalloc(&device_1_data, DATA_SIZE); cudaSetDevice(0); cudaDeviceEnablePeerAccess(1, 0); cudaEventCreate(&start); cudaEventCreate(&stop); cout << "host->device" << endl; TIMEIT(cudaMemcpy(device_0_data, host_data.data(), DATA_SIZE, cudaMemcpyHostToDevice)) cout << endl << "device->device" << endl; TIMEIT(cudaMemcpy(device_0_data_copy, device_0_data, DATA_SIZE, cudaMemcpyDeviceToDevice)) cout << endl << "device->peer device" << endl; TIMEIT(cudaMemcpyPeer(device_1_data, 1, device_0_data, 0, DATA_SIZE)); cout << endl << "device->host" << endl; TIMEIT(cudaMemcpy(host_data.data(), device_0_data, DATA_SIZE, cudaMemcpyDeviceToHost)) cudaEventDestroy(start); cudaEventDestroy(stop); cudaFree(device_0_data); cudaFree(device_0_data_copy); cudaFree(device_1_data); return 0; }

I ran this on a machine with two RTX 4500's connected by an NVLink bridge. The output is below.

host->device function call: 9606 microseconds device work: 9626 microseconds time to sync host: 9628 microseconds device->device function call: 16 microseconds device work: 367 microseconds time to sync host: 370 microseconds device->peer device function call: 10 microseconds device work: 1930 microseconds time to sync host: 1938 microseconds device->host function call: 4817 microseconds device work: 4818 microseconds time to sync host: 4822 microseconds

As you can see, the function call time and time to sync is nearly the same for both host->device and device->host transfers. This is consistent with the expected host blocking behavior.

However, cudaMemcpy for device->device and cudaMemcpyPeer return immediately and the time to sync is much longer. I believe this supports the conclusion that cudaMemcpyDeviceToDevice does not block the host and introducing cudaMemcpyPeer would not affect the current dlib behavior. This does mean that the docs for dlib::memcpy are already inaccurate in regards to host blocking.

I also want to note that I also disabled peer access for the above test and cudaMemcpyPeer behaves the same except the transfer time is much longer because it is not using the NVLink connection.

Yeah let me look this over tonight or tomorrow.

Yeah the dlib doc is a little confusing. It does say this at the top of the gpu_data_abstract.h file (but I only looked at the comment on the memcpy() function and forgot about this detail):

All transfers to the device happen asynchronously with respect to the default CUDA stream so that CUDA kernel computations can overlap with data transfers. However, any transfers from the device to the host happen synchronously in the default CUDA stream. Therefore, you should perform all your CUDA kernel launches on the default stream so that transfers back to the host do not happen before the relevant computations have completed.

Although I still don't understand the purpose of this PR. I've been reading the cuda docs for a while and near as I can tell cudaMemcpy(..., cudaMemcpyDeviceToDevice) should be the same as cudaMemcpyPeer() here. What am I missing?

I'm assuming there was something you were trying to do that doesn't work with the current dlib code but this makes it work? What is that thing?

I created this PR because my code uses dlib::memcpy to copy a tensor on device 0 to device 1. This resulted in an illegal memory access error (even with peer access enabled). My CUDA version is 12.6.

After researching the issue, I had found that cudaMemcpy(..., cudaMemcpyDeviceToDevice) only supports copies within a device and that you must use cudaMemcpyPeer if you want to perform copies to another device. However, it turns out that you shouldn't have to use cudaMemcpyPeer with Unified Addressing enabled (which I do). I now believe this was due to a driver bug because I tried to reproduce the problem yesterday with an updated driver and there is no error.

The 12.6 documentation didn't mention the behavior of cudaMemcpyDeviceToDevice with Unified Addressing (it only mentions cudaMemcpyDefault). The 13.2 Programming Guide calls it out specifically in Explicit Memory Management.

So, the difference between the two functions is that cudaMemcpyPeer is required for systems (or applications) without Unified Addressing enabled. In my specific case, it was likely just a driver bug.

Yeah it would only be for ancient setups (like 15 years old now maybe, I forget) where this would matter. Since unified addressing has been around forever at this point.

And yeah, nvidia driver bugs are the worst. I broke my nvidia drivers updating a python package this weekend while testing some other dlib thing and then everything got wonky. You cough on the drivers and something breaks :|

Yeah, definitely. I need to start assuming its their fault first instead of my code or dlib :)

I do have cards that are 15 years old, but I'm not currently running dlib on them, so I'll close this since my immediate problem is solved.

Thanks again!

No problem :D

kSkip added 2 commits February 10, 2026 08:16

Use cudaMemcpyPeer if tensors are on different devices

74da1e4

Merge branch 'master' into enable-peer-memcpy

167c6fc

davisking reviewed Feb 15, 2026

View reviewed changes

Repository owner deleted a comment from dlib-issue-bot Mar 27, 2026

kSkip closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable copies between different devices#3135

Enable copies between different devices#3135
kSkip wants to merge 2 commits intodavisking:masterfrom
kSkip:enable-peer-memcpy

kSkip commented Feb 10, 2026

Uh oh!

davisking Feb 15, 2026 •

edited

Loading

Uh oh!

kSkip Feb 15, 2026 •

edited

Loading

Uh oh!

kSkip Mar 23, 2026

Uh oh!

davisking Mar 27, 2026

Uh oh!

davisking Mar 29, 2026

Uh oh!

kSkip Mar 30, 2026

Uh oh!

davisking Mar 31, 2026

Uh oh!

kSkip Apr 1, 2026

Uh oh!

davisking Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kSkip commented Feb 10, 2026

Uh oh!

davisking Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kSkip Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kSkip Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

davisking Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

davisking Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

kSkip Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

davisking Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

kSkip Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

davisking Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davisking Feb 15, 2026 •

edited

Loading

kSkip Feb 15, 2026 •

edited

Loading