CUDA中异步内存期间的设备同步



假设我想在CUDA中执行异步内存主机到设备,然后立即运行内核。如果异步传输已经完成,我如何在内核中测试?

使用CUDA"流"对异步复制和内核启动进行排序,确保内核在异步传输完成后执行。下面的代码示例演示:

#include <stdio.h>
__global__ void kernel(const int *ptr)
{
  printf("Hello, %dn", *ptr);
}
int main()
{
  int *h_ptr = 0;
  // allocate pinned host memory with cudaMallocHost
  // pinned memory is required for asynchronous copy
  cudaMallocHost(&h_ptr, sizeof(int));
  // look for thirteen in the output
  *h_ptr = 13;
  // allocate device memory
  int *d_ptr = 0;
  cudaMalloc(&d_ptr, sizeof(int));
  // create a stream
  cudaStream_t stream;
  cudaStreamCreate(&stream);
  // sequence the asynchronous copy on our stream
  cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);
  // sequence the kernel on our stream after the copy
  // the kernel will execute after the copy has completed
  kernel<<<1,1,0,stream>>>(d_ptr);
  // clean up after ourselves
  cudaStreamDestroy(stream);
  cudaFree(d_ptr);
  cudaFreeHost(h_ptr);
}

输出:

$ nvcc -arch=sm_20 async.cu -run
Hello, 13

我不相信有任何支持的方法可以从内核内部测试是否满足某些异步条件(例如异步传输的完成)。CUDA线程块被认为是完全独立于其他线程执行的。

最新更新