我正在努力学习CUDA,现在我只能运行一个简单的nvprof命令。
我正在使用CUDA在C++和Fortran中测试一个简单的脚本。CUDA内核测试了执行简单任务的两种不同方式,目的是显示分支分歧问题的重要性。
当我跑步时nvprof --metrics branch_efficiency ./codeCpp.x
(即,在c++代码上(,该命令有效,但当我在相应的fortran代码上尝试相同的操作时,它不起作用。令人好奇的是,一个简单的<nvprof/codeFortran.x>在fortran上,可执行文件将显示输出,但任何带有<--度量>;旗帜不会。下面我粘贴了一些信息:(注意,这两个代码都编译得很好,不会产生任何运行时错误(。我正在使用Ubuntu 20
有人可以帮助理解这个问题吗?谢谢
=========================c++代码
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include "device_launch_parameters.h"
#include "cuda_common.cuh"
// kernel without divergence
__global__ void code_without_divergence(){
// compute unique grid index
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// define some local variables
float a, b;
a = b = 0.0;
// compute the warp index
int warp_id = gid/32;
// conditional statement based on the warp id
if (warp_id % 2 == 0)
{
a = 150.0;
b = 50.0;
}
else
{
a = 200.0;
b = 75.0;
};
}
// kernel with divergence
__global__ void code_with_divergence(){
// compute unique grid index
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// define some local variables
float a, b;
a = b = 0.0;
// conditional statement based on the gid. This will force difference
// code branches within the same warp.
if (gid % 2 == 0)
{
a = 150.0;
b = 50.0;
}
else
{
a = 200.0;
b = 75.0;
};
}
int main (int argc, char** argv){
// set the block size
int size = 1 << 22;
dim3 block_size(128);
dim3 grid_size((size + block_size.x-1)/block_size.x);
code_without_divergence <<< grid_size, block_size>>>();
cudaDeviceSynchronize();
code_with_divergence <<<grid_size, block_size>>>();
cudaDeviceSynchronize();
cudaDeviceReset();
return EXIT_SUCCESS;
};
====================Fortran代码
MODULE CUDAUtils
USE cudafor
IMPLICIT NONE
CONTAINS
! code without divergence routine
ATTRIBUTES(GLOBAL) SUBROUTINE code_without_divergence()
IMPLICIT NONE
!> local variables
INTEGER :: threadId, warpIdx
REAL(KIND=8) :: a,b
! get the unique threadID
threadId = (blockIdx%y-1) * gridDim%x * blockDim%x + &
(blockIdx%x-1) * blockDim%x + (threadIdx%x-1)
! adjust so that the threadId starts from 1
threadId = threadId + 1
! warp index
warpIdx = threadIdx%x/32
! perform the conditional statement
IF (MOD(warpIdx,2) == 0) THEN
a = 150.0D0
b = 50.0D0
ELSE
a = 200.0D0
b = 75.0D0
END IF
END SUBROUTINE code_without_divergence
! code with divergence routine
ATTRIBUTES(GLOBAL) SUBROUTINE code_with_divergence()
IMPLICIT NONE
!> local variables
INTEGER :: threadId, warpIdx
REAL(KIND=8) :: a,b
! get the unique threadID
threadId = (blockIdx%y-1) * gridDim%x * blockDim%x + &
(blockIdx%x-1) * blockDim%x + (threadIdx%x-1)
! adjust so that the threadId starts from 1
threadId = threadId + 1
! perform the conditional statement
IF (MOD(threadId,2) == 0) THEN
a = 150.0D0
b = 50.0D0
ELSE
a = 200.0D0
b = 75.0D0
END IF
END SUBROUTINE code_with_divergence
END MODULE CUDAUtils
PROGRAM main
USE CUDAUtils
IMPLICIT NONE
! define the variables
INTEGER :: size1 = 1e20
INTEGER :: istat
TYPE(DIM3) :: grid, tBlock
! blocksize is 42 along the 1st dimension only whereas grid is 2D
tBlock = DIM3(128,1,1)
grid = DIM3((size1 + tBlock%x)/tBlock%x,1,1)
! just call the module
CALL code_without_divergence<<<grid,tBlock>>>()
istat = cudaDeviceSynchronize()
! just call the module
CALL code_with_divergence<<<grid,tBlock>>>()
istat = cudaDeviceSynchronize()
STOP
END PROGRAM main
nvprof --metrics branch_efficiency ./codeCpp.x
的输出
=6944== NVPROF is profiling process 6944, command: ./codeCpp.x
==6944== Profiling application: ./codeCpp.x
==6944== Profiling result:
==6944== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "NVIDIA GeForce MX330 (0)"
Kernel: code_without_divergence(void)
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
Kernel: code_with_divergence(void)
1 branch_efficiency Branch Efficiency 85.71% 85.71% 85.71%
nvprof --metrics branch_efficiency ./codeFortran.x
的输出
==6983== NVPROF is profiling process 6983, command: ./codeFortran.x
==6983== Profiling application: ./codeFortran.x
==6983== Profiling result:
No events/metrics were profiled.
nvprof ./codeFortran.x
的输出
==7002== NVPROF is profiling process 7002, command: ./codeFortran.x
==7002== Profiling application: ./codeFortran.x
==7002== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 99.82% 153.45ms 2 76.726ms 516ns 153.45ms cudaLaunchKernel
0.15% 231.24us 101 2.2890us 95ns 172.81us cuDeviceGetAttribute
0.01% 22.522us 1 22.522us 22.522us 22.522us cuDeviceGetName
0.01% 9.1310us 1 9.1310us 9.1310us 9.1310us cuDeviceGetPCIBusId
0.00% 5.4500us 2 2.7250us 876ns 4.5740us cudaDeviceSynchronize
0.00% 1.3480us 3 449ns 195ns 903ns cuDeviceGetCount
0.00% 611ns 1 611ns 611ns 611ns cuModuleGetLoadingMode
0.00% 520ns 2 260ns 117ns 403ns cuDeviceGet
0.00% 245ns 1 245ns 245ns 245ns cuDeviceTotalMem
0.00% 187ns 1 187ns 187ns 187ns cuDeviceGetUuid
c++和Fortran可执行程序都测试相同的CUDA概念。它们都能很好地编译,并且在执行时不会在终端上显示运行时错误。当我在c++程序上尝试nvprof
命令时,一切都按预期进行,但当我在相应的fortran程序上尝试时,没有输出(使用--metrics
标志(。我希望用c++代码获得同样的行为。
0在其他一些讨论中,我发现对于7以上的GPU版本,不再支持nvprof,应该使用NVIDIA Nsight,但我认为情况并非如此,因为我使用c++程序获得了预期的输出。
代码没有按预期进行分析的原因是在这种情况下内核实际上没有正确运行。
在尝试任何分析之前,确保代码没有运行时错误始终是一种很好的做法。正确的CUDA错误检查和compute-sanitizer
是帮助实现这一点的两种方法。