Titan RTX上的矩阵乘法基准测试,具有双精度和单精度



我正在努力了解GPU工作站的单精度和双精度之间的性能差异。

我们的工作站配备了两个TITAN RTX GPU,但我正在一个sigle TITAN RTX上运行基准测试。我正在用cublas矩阵矩阵乘法测试性能。我将8192x8192个矩阵相乘,这些矩阵由随机浮点或双精度组成。为了确保我不会出错,我还使用cupy库在Python中重复了这个过程,结果非常相似。

对于浮点运算,测试结果为每1次乘法约75毫秒,而对于双精度运算,则为约2000毫秒。

如果我有一个旧的GPU,这将非常有意义,因为75*32=2400~2000,所以我的双精度性能将比表中预期的差32倍https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-说明。

然而,我的GPU具有7.5的计算能力,因此我预计性能会下降两倍,仅为2倍。

其他信息:Ubuntu 18 LTS,nvcc 10.2,驱动程序440.82。

这是CUDA代码:

#include <iostream>
#include <chrono>
#include <string>
#include <cuda_runtime.h>
#include "cublas_v2.h"
#include <math.h>
#include <stdio.h>
#include <cuda.h>
#include <device_functions.h>
#include <sstream>
#include <time.h>
unsigned long mix(unsigned long a, unsigned long b, unsigned long c)
{
a=a-b;  a=a-c;  a=a^(c >> 13);
b=b-c;  b=b-a;  b=b^(a << 8);
c=c-a;  c=c-b;  c=c^(b >> 13);
a=a-b;  a=a-c;  a=a^(c >> 12);
b=b-c;  b=b-a;  b=b^(a << 16);
c=c-a;  c=c-b;  c=c^(b >> 5);
a=a-b;  a=a-c;  a=a^(c >> 3);
b=b-c;  b=b-a;  b=b^(a << 10);
c=c-a;  c=c-b;  c=c^(b >> 15);
return c;
}

using namespace std;
int main()
{
int deviceCount;
cudaGetDeviceCount(&deviceCount);
cudaDeviceProp deviceProp;
cublasStatus_t err;
cudaGetDeviceProperties(&deviceProp, 0);
printf("Detected %d devices n", deviceCount);
printf("Device %d has compute capability %d.%d:nt maxshmem %d. nt maxthreads per block %d. nt max threads dim %d. %d. %d.n ", 0,
deviceProp.major, deviceProp.minor, deviceProp.sharedMemPerBlock, deviceProp.maxThreadsPerBlock, deviceProp.maxThreadsDim[0],
deviceProp.maxThreadsDim[1], deviceProp.maxThreadsDim[2]);
cudaEvent_t start_d, stop_d;
cudaEventCreate(&start_d);
cudaEventCreate(&stop_d);
//RND insicialization
unsigned long seed = mix(clock(), time(NULL), 0);
srand(seed);

int N=8192;
int Nloops=2;
int memsize=N*N*sizeof(double);
double *a = (double *)malloc(memsize);
double *b = (double *)malloc(memsize);
double *c = (double *)malloc(memsize);
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++){
a[i*N+j]=((double)rand() / RAND_MAX);
b[i*N+j]=((double)rand() / RAND_MAX);
}
double *a_d, *b_d, *c_d;
cudaMalloc((void **)&a_d, memsize);
cudaMalloc((void **)&b_d, memsize);
cudaMalloc((void **)&c_d, memsize);
cudaMemcpy(a_d, a, memsize, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b, memsize, cudaMemcpyHostToDevice);
cublasHandle_t handle;
cublasCreate(&handle);
double alpha=1.0;
double beta=0.0;

auto start = chrono::steady_clock::now();
clock_t start1;
start1 = clock();
cudaEventRecord(start_d);
if (cudaGetLastError() != cudaSuccess)
printf("%s n",cudaGetErrorString(cudaGetLastError()));
for (int i=0; i<Nloops; i++)
cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N,N,N,&alpha,a_d,N,b_d,N,&beta,c_d,N);
cudaEventRecord(stop_d);
cudaDeviceSynchronize();
auto end = chrono::steady_clock::now();
start1 = clock() - start1;
cudaEventSynchronize(stop_d);
cublasDestroy(handle);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start_d, stop_d);
std::cout << "Cuda event " << milliseconds /Nloops << " ms" <<endl;
std::cout << " time elapsed " << start1 / (double)CLOCKS_PER_SEC /Nloops << 'n';
cout << "time elapsed for 1 multiplication: " << ((double)chrono::duration_cast<chrono::microseconds>(end-start).count() )/(Nloops*1000.0)<< " milliseconds" <<endl;
free(a); free(b); free(c);
cudaFree(a_d); cudaFree(b_d); cudaFree(c_d);
}

这就是产生一致结果的python代码:

import cupy as cp
import time
iterations = 2
a = cp.random.rand(8192,8192).astype(cp.float64)
b = cp.random.rand(8192,8192).astype(cp.float64)
def ab(a,b,iterations):
for i in range(iterations):
cp.matmul(a,b,out=None)
ab(a,b,1) # warm up
cp.cuda.Device(0).synchronize()
t1 = time.time()
ab(a,b,iterations)
cp.cuda.Device(0).synchronize()
t2 = time.time()
total = (t2-t1)/iterations
print(total)

好的,我找到了答案。在我在quesiton中链接的那个表中,有一个脚注说,对于计算能力7.5(这里就是这种情况(,性能是2,但不是32,对于浮点,性能是64,这意味着双精度的乘加运算比浮点慢32倍。

如果float和double问题都是完全算术约束的,我预计速度会减慢约32。事实上,速度减慢的幅度略小(2000/75~27(,这可能是浮动被带宽限制的问题的结果,也可能与其他事情有关。

最新更新