为什么 nvprof 没有浮点除法操作的指标?



nvprof来测量我的示例内核的浮点运算,似乎没有flop_count_dp_div的指标,实际的双精度除法运算是以双精度的add/mul/fma甚至一些单精度运算的fma来衡量的。

我想知道为什么会这样,如果我没有源代码,如何从报告中推断出内核nvprof除法运算的动态数量?

我的简单测试内核:

#include <iostream>
__global__ void mul(double a, double* x, double* y) {
y[threadIdx.x] = a * x[threadIdx.x];
}
__global__ void div(double a, double* x, double* y) {
y[threadIdx.x] = a / x[threadIdx.x];
}
int main(int argc, char* argv[]) {
const int kDataLen = 4;
double a = 2.0f;
double host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
double host_y[kDataLen];
// Copy input data to device.
double* device_x;
double* device_y;
cudaMalloc(&device_x, kDataLen * sizeof(double));
cudaMalloc(&device_y, kDataLen * sizeof(double));
cudaMemcpy(device_x, host_x, kDataLen * sizeof(double),
cudaMemcpyHostToDevice);
// Launch the kernel.
mul<<<1, kDataLen>>>(a, device_x, device_y);
div<<<1, kDataLen>>>(a, device_x, device_y);
// Copy output data to host.
cudaDeviceSynchronize();
cudaMemcpy(host_y, device_y, kDataLen * sizeof(double),
cudaMemcpyDeviceToHost);
// Print the results.
for (int i = 0; i < kDataLen; ++i) {
std::cout << "y[" << i << "] = " << host_y[i] << "n";
}
cudaDeviceReset();
return 0;
}

nvprof两个内核的输出:

nvprof --metrics flop_count_sp          
--metrics flop_count_sp_add      
--metrics flop_count_sp_mul      
--metrics flop_count_sp_fma      
--metrics flop_count_sp_special  
--metrics flop_count_dp          
--metrics flop_count_dp_add      
--metrics flop_count_dp_mul      
--metrics flop_count_dp_fma      
./a.out
==14380== NVPROF is profiling process 14380, command: ./a.out
==14380== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "mul(double, double*, double*)" (done)
Replaying kernel "div(double, double*, double*)" (done)
y[0] = 24 internal events
y[1] = 1
y[2] = 0.666667
y[3] = 0.5
==14380== Profiling application: ./a.out
==14380== Profiling result:
==14380== Metric result:
Invocations                               Metric Name                                    Metric Description         Min         Max         Avg
Device "GeForce GTX 1080 Ti (0)"
Kernel: mul(double, double*, double*)
1                             flop_count_sp           Floating Point Operations(Single Precision)           0           0           0
1                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
1                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
1                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           0           0           0
1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           0           0           0
1                             flop_count_dp           Floating Point Operations(Double Precision)           4           4           4
1                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
1                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           4           4           4
1                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
Kernel: div(double, double*, double*)
1                             flop_count_sp           Floating Point Operations(Single Precision)           8           8           8
1                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
1                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
1                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           4           4           4
1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           4           4           4
1                             flop_count_dp           Floating Point Operations(Double Precision)          44          44          44
1                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
1                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           4           4           4
1                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)          20          20          20
似乎

没有flop_count_dp_div的指标,t

因为 CUDA 硬件中没有浮点除法指令。

而实际的双精度

除法运算是以双精度的Add/MUL/FMA甚至一些单精度运算的FMA来衡量的。

因为浮点除法是使用牛顿拉夫森迭代方法实现的,该方法使用乘加和乘运算。甚至可能在混合精度下(因此是单精度操作(

如果我没有源代码,如何从 NVPROF 报告中推断内核的划分操作的动态数量?

你真的不能。

最新更新