使用nvprof进行评测时没有GPU活动

我在初始化数据、调用三个内核和free的数据的函数上运行nvprof.exe。所有的描述都是应该的，我得到的结果是这样的：

==7956== Profiling application: .a.exe
==7956== Profiling result:
GPU activities:   52.34%  25.375us         1  25.375us  25.375us  25.375us  th_single_row_add(float*, float*, float*)                                   
43.57%  21.120us         1  21.120us  21.120us  21.120us  th_single_col_add(float*, float*, float*)                                       
4.09%  1.9840us         1  1.9840us  1.9840us  1.9840us  th_single_elem_add(float*, float*, float*)                        
API calls:   86.77%  238.31ms         9  26.479ms  14.600us  210.39ms  cudaMallocManaged
12.24%  33.621ms         1  33.621ms  33.621ms  33.621ms  cuDevicePrimaryCtxRelease
0.27%  730.80us         3  243.60us  242.10us  245.60us  cudaLaunchKernel
0.15%  406.90us         3  135.63us  65.400us  170.80us  cudaDeviceSynchronize
0.08%  229.70us        97  2.3680us     100ns  112.10us  cuDeviceGetAttribute
0.08%  206.60us         1  206.60us  206.60us  206.60us  cuModuleUnload
0.01%  19.700us         1  19.700us  19.700us  19.700us  cuDeviceTotalMem
0.00%  6.8000us         1  6.8000us  6.8000us  6.8000us  cuDeviceGetPCIBusId
0.00%  1.9000us         2     950ns     400ns  1.5000us  cuDeviceGet
0.00%  1.8000us         3     600ns     400ns     800ns  cuDeviceGetCount
0.00%     700ns         1     700ns     700ns     700ns  cuDeviceGetName
0.00%     200ns         1     200ns     200ns     200ns  cuDeviceGetUuid
0.00%     200ns         1     200ns     200ns     200ns  cuDeviceGetLuid
==7956== Unified Memory profiling result:
Device "GeForce RTX 2060 SUPER (0)"
Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
18  20.000KB  8.0000KB  32.000KB  360.0000KB  300.7000us  Host To Device
24  20.000KB  8.0000KB  32.000KB  480.0000KB  2.647400ms  Device To Host

如您所见，GPU activities中有三个内核。这是源代码：

void add_elem(int n) {
float *a, *b, *c1, *c2, *c3;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
cudaMallocManaged(&c2, n * n * sizeof(float));
cudaMallocManaged(&c3, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
c2[i] = 0.0f;
c3[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n*n + blockSize - 1) / blockSize;
th_single_elem_add<<<numBlocks, blockSize>>>(a, b, c1);
th_single_row_add<<<numBlocks, blockSize>>>(a, b, c2);
th_single_col_add<<<numBlocks, blockSize>>>(a, b, c3);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
cudaFree(c2);
cudaFree(c3);
}

之后，我提取初始化数据、内核调用和释放数据以分离主机函数，并再次调用nvprof。结果我只得到了关于API调用的信息，比如：

==18460== Profiling application: .a.exe
==18460== Profiling result:
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
API calls:   81.86%  158.78ms         9  17.643ms  1.4000us  158.76ms  cudaMallocManaged
0.17%  322.80us        97  3.3270us     100ns  158.00us  cuDeviceGetAttribute
0.11%  214.50us         1  214.50us  214.50us  214.50us  cuModuleUnload
0.04%  68.600us         3  22.866us  7.3000us  39.400us  cudaDeviceSynchronize
0.01%  12.100us         9  1.3440us     400ns  7.9000us  cudaFree
0.00%  7.7000us         1  7.7000us  7.7000us  7.7000us  cuDeviceGetPCIBusId
0.00%  2.1000us         3     700ns     300ns  1.0000us  cuDeviceGetCount
0.00%  2.0000us         2  1.0000us     300ns  1.7000us  cuDeviceGet
0.00%  1.2000us         3     400ns     300ns     500ns  cudaLaunchKernel
0.00%     700ns         1     700ns     700ns     700ns  cuDeviceGetName
0.00%     300ns         1     300ns     300ns     300ns  cuDeviceGetUuid
0.00%     300ns         1     300ns     300ns     300ns  cuDeviceGetLuid

如您所见，也没有Unified Memory profiling result部分，所以我试着像nvprof.exe --unified-memory-profiling off .a.exe一样运行nvprof，但得到了相同的结果。源代码：

void add_elem(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n*n + blockSize - 1) / blockSize;
th_single_elem_add<<<numBlocks, blockSize>>>(a, b, c1);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}
void add_row(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
th_single_row_add<<<numBlocks, blockSize>>>(a, b, c1, n);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}
void add_col(int n) {
float *a, *b, *c1;
cudaMallocManaged(&a, n * n * sizeof(float));
cudaMallocManaged(&b, n * n * sizeof(float));
cudaMallocManaged(&c1, n * n * sizeof(float));
for (int i = 0; i < n*n; i++) {
a[i] = 1.0f;
b[i] = 2.0f;
c1[i] = 0.0f;
}
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;
th_single_col_add<<<numBlocks, blockSize>>>(a, b, c1, n);
cudaDeviceSynchronize();
cudaFree(a);
cudaFree(b);
cudaFree(c1);
}

更新：我发现了这个问题，我在数组中运行了10000000000个元素的代码，看起来内核甚至没有被调用。因为我用10000000(10^8(个元素运行它们，花了将近3秒的时间才完成，用10000000000(10^10(就立即完成了。但是没有任何错误。

我该如何处理此类案件？

这里的原因是使用不支持的<<<numBlocks, blockSize>>>调用内核。在每次内核调用后添加gpuErrchk( cudaPeekAtLastError() );后，我得到了GPUassert: invalid configuration argument，这意味着我的GPUnumBlocks或blockSize参数不支持我。在没有错误检查的情况下，脚本只是安静地结束。正如Robber Corvella在评论中所建议的，这里是propper错误处理链接：

正确的CUDA错误检查

此外，运行cuda-memcheck有助于

相关内容

最新更新

热门标签：