使用cudaEventRecord()为多GPU程序记录CUDA内核的运行时间

我有一个稀疏三角形解算器，可与4个特斯拉V100 GPU配合使用。我完成了实施，所有事情在准确性方面都做得很好。然而，我使用CPU计时器来计算经过的时间。我知道CPU计时器不是计算运行时间的完美选择，因为我可以使用CUDA事件。

但问题是，我不知道如何为多GPU实现CUDA事件。正如我从NVIDIA教程中看到的，它们使用事件进行GPU间同步，即等待其他GPU完成依赖关系。无论如何，我对事件的定义是：；

cudaEvent_t start_events[num_gpus]
cudaEvent_t end_events[num_gpus]

我还可以通过迭代设置当前GPU来在循环中初始化这些事件。

我的内核执行是这样的；

for(int i = 0; i < num_gpus; i++)
{
CUDA_FUNC_CALL(cudaSetDevice(i));
kernel<<<>>>()
}
for(int i = 0; i < num_devices; i++)
{
CUDA_FUNC_CALL(cudaSetDevice(i));
CUDA_FUNC_CALL(cudaDeviceSynchronize());
}

我的问题是，我应该如何使用这些事件分别记录每个GPU的运行时间？

您需要为每个GPU创建两个事件，并在每个GPU上记录内核调用前后的事件。

它可能看起来像这样：

cudaEvent_t start_events[num_gpus];
cudaEvent_t end_events[num_gpus];
for(int i = 0; i < num_gpus; i++)
{
CUDA_FUNC_CALL(cudaSetDevice(i));
CUDA_FUNC_CALL(cudaEventCreate(&start_events[i]));
CUDA_FUNC_CALL(cudaEventCreate(&end_events[i]));
}
for(int i = 0; i < num_gpus; i++)
{
CUDA_FUNC_CALL(cudaSetDevice(i));
// In cudaEventRecord, ommit stream or set it to 0 to record 
// in the default stream. It must be the same stream as 
// where the kernel is launched.
CUDA_FUNC_CALL(cudaEventRecord(start_events[i], stream)); 
kernel<<<>>>()
CUDA_FUNC_CALL(cudaEventRecord(end_events[i], stream));
}
for(int i = 0; i < num_devices; i++)
{
CUDA_FUNC_CALL(cudaSetDevice(i));
CUDA_FUNC_CALL(cudaDeviceSynchronize());
}
for(int i = 0; i < num_devices; i++)
{
//the end_event must have happened to get a valid duration
//In this example, this is true because of previous device synchronization
float time_in_ms;
CUDA_FUNC_CALL(cudaEventElapsedTime(&time_in_ms, start_events[i], end_events[i]));
printf("Elapsed time on device %d: %f msn", i, time_in_ms)
}
for(int i = 0; i < num_gpus; i++)
{
CUDA_FUNC_CALL(cudaSetDevice(i));
CUDA_FUNC_CALL(cudaEventDestroy(start_events[i]));
CUDA_FUNC_CALL(cudaEventDestroy(end_events[i]));
}

相关内容

最新更新

热门标签：