nvprof——metrics适用于c++可执行文件，但不适用于fortran可执行文件

我正在努力学习CUDA，现在我只能运行一个简单的nvprof命令。

我正在使用CUDA在C++和Fortran中测试一个简单的脚本。CUDA内核测试了执行简单任务的两种不同方式，目的是显示分支分歧问题的重要性。

当我跑步时nvprof --metrics branch_efficiency ./codeCpp.x(即，在c++代码上(，该命令有效，但当我在相应的fortran代码上尝试相同的操作时，它不起作用。令人好奇的是，一个简单的<nvprof/codeFortran.x>在fortran上，可执行文件将显示输出，但任何带有&lt--度量&gt；旗帜不会。下面我粘贴了一些信息：(注意，这两个代码都编译得很好，不会产生任何运行时错误(。我正在使用Ubuntu 20

有人可以帮助理解这个问题吗？谢谢

=========================c++代码

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include "device_launch_parameters.h"
#include "cuda_common.cuh"
// kernel without divergence
__global__ void code_without_divergence(){
// compute unique grid index
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// define some local variables
float a, b;
a = b = 0.0;
// compute the warp index
int warp_id = gid/32;
// conditional statement based on the warp id
if (warp_id % 2 == 0)
{
a = 150.0;
b = 50.0;
}
else
{
a = 200.0;
b = 75.0;
};
}
// kernel with divergence
__global__ void code_with_divergence(){
// compute unique grid index
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// define some local variables
float a, b;
a = b = 0.0;
// conditional statement based on the gid. This will force difference
// code branches within the same warp.
if (gid % 2 == 0)
{
a = 150.0;
b = 50.0;
}
else
{
a = 200.0;
b = 75.0;
};
}
int main (int argc, char** argv){
// set the block size
int size = 1 << 22;
dim3 block_size(128);
dim3 grid_size((size + block_size.x-1)/block_size.x);
code_without_divergence <<< grid_size, block_size>>>();
cudaDeviceSynchronize();
code_with_divergence <<<grid_size, block_size>>>();
cudaDeviceSynchronize();
cudaDeviceReset();
return EXIT_SUCCESS;
};

====================Fortran代码

MODULE CUDAUtils
USE cudafor
IMPLICIT NONE

CONTAINS
! code without divergence routine
ATTRIBUTES(GLOBAL) SUBROUTINE code_without_divergence()
IMPLICIT NONE
!> local variables
INTEGER :: threadId, warpIdx
REAL(KIND=8) :: a,b
! get the unique threadID
threadId =   (blockIdx%y-1) * gridDim%x  * blockDim%x + &
(blockIdx%x-1) * blockDim%x + (threadIdx%x-1)
! adjust so that the threadId starts from 1
threadId = threadId + 1
! warp index
warpIdx = threadIdx%x/32
! perform the conditional statement
IF (MOD(warpIdx,2) == 0) THEN
a = 150.0D0
b = 50.0D0
ELSE
a = 200.0D0
b = 75.0D0
END IF
END SUBROUTINE code_without_divergence
! code with divergence routine
ATTRIBUTES(GLOBAL) SUBROUTINE code_with_divergence()
IMPLICIT NONE
!> local variables
INTEGER :: threadId, warpIdx
REAL(KIND=8) :: a,b
! get the unique threadID
threadId =   (blockIdx%y-1) * gridDim%x  * blockDim%x + &
(blockIdx%x-1) * blockDim%x + (threadIdx%x-1)
! adjust so that the threadId starts from 1
threadId = threadId + 1
! perform the conditional statement
IF (MOD(threadId,2) == 0) THEN
a = 150.0D0
b = 50.0D0
ELSE
a = 200.0D0
b = 75.0D0
END IF
END SUBROUTINE code_with_divergence
END MODULE CUDAUtils
PROGRAM main
USE CUDAUtils
IMPLICIT NONE
! define the variables
INTEGER    :: size1 = 1e20
INTEGER    :: istat
TYPE(DIM3) :: grid, tBlock
! blocksize is 42 along the 1st dimension only whereas grid is 2D
tBlock = DIM3(128,1,1)
grid   = DIM3((size1 + tBlock%x)/tBlock%x,1,1)
! just call the module
CALL code_without_divergence<<<grid,tBlock>>>()
istat = cudaDeviceSynchronize()
! just call the module
CALL code_with_divergence<<<grid,tBlock>>>()
istat = cudaDeviceSynchronize()

STOP
END PROGRAM main

nvprof --metrics branch_efficiency ./codeCpp.x的输出

=6944== NVPROF is profiling process 6944, command: ./codeCpp.x
==6944== Profiling application: ./codeCpp.x
==6944== Profiling result:
==6944== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "NVIDIA GeForce MX330 (0)"
Kernel: code_without_divergence(void)
1                         branch_efficiency                         Branch Efficiency     100.00%     100.00%     100.00%
Kernel: code_with_divergence(void)
1                         branch_efficiency                         Branch Efficiency      85.71%      85.71%      85.71%

nvprof --metrics branch_efficiency ./codeFortran.x的输出

==6983== NVPROF is profiling process 6983, command: ./codeFortran.x
==6983== Profiling application: ./codeFortran.x
==6983== Profiling result:
No events/metrics were profiled.

nvprof ./codeFortran.x的输出

==7002== NVPROF is profiling process 7002, command: ./codeFortran.x
==7002== Profiling application: ./codeFortran.x
==7002== Profiling result:
No kernels were profiled.
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
API calls:   99.82%  153.45ms         2  76.726ms     516ns  153.45ms  cudaLaunchKernel
0.15%  231.24us       101  2.2890us      95ns  172.81us  cuDeviceGetAttribute
0.01%  22.522us         1  22.522us  22.522us  22.522us  cuDeviceGetName
0.01%  9.1310us         1  9.1310us  9.1310us  9.1310us  cuDeviceGetPCIBusId
0.00%  5.4500us         2  2.7250us     876ns  4.5740us  cudaDeviceSynchronize
0.00%  1.3480us         3     449ns     195ns     903ns  cuDeviceGetCount
0.00%     611ns         1     611ns     611ns     611ns  cuModuleGetLoadingMode
0.00%     520ns         2     260ns     117ns     403ns  cuDeviceGet
0.00%     245ns         1     245ns     245ns     245ns  cuDeviceTotalMem
0.00%     187ns         1     187ns     187ns     187ns  cuDeviceGetUuid

c++和Fortran可执行程序都测试相同的CUDA概念。它们都能很好地编译，并且在执行时不会在终端上显示运行时错误。当我在c++程序上尝试nvprof命令时，一切都按预期进行，但当我在相应的fortran程序上尝试时，没有输出(使用--metrics标志(。我希望用c++代码获得同样的行为。

0在其他一些讨论中，我发现对于7以上的GPU版本，不再支持nvprof，应该使用NVIDIA Nsight，但我认为情况并非如此，因为我使用c++程序获得了预期的输出。

代码没有按预期进行分析的原因是在这种情况下内核实际上没有正确运行。

在尝试任何分析之前，确保代码没有运行时错误始终是一种很好的做法。正确的CUDA错误检查和compute-sanitizer是帮助实现这一点的两种方法。

相关内容

最新更新

热门标签：