OpenACC CPU vs GPU optimization



我不在OpenACC,我试图优化代码,为CPU我得到:

Time = Time + omp_get_wtime();
{
#pragma acc parallel loop
for (int i = 1;i < k-1; i++)
{
jcount[i]=((int)(MLT[i]/dt))+1;
}
jcount[0]=0;
jcount[k-1]=N;
#pragma acc parallel loop collapse(2)
for (int i = 0;i < k - 1; i++)
{
for(int j=jcount[i];j < jcount[i+1];j++)
{
w[j] = (j*dt - MLT[i])/(MLT[i+1]-MLT[i]);
X[j] = MLX[i]*(1-w[j])+MLX[i+1]*w[j];
Y[j] = MLY[i]*(1-w[j])+MLY[i+1]*w[j];
}
}
}
Time = omp_get_wtime() - Time;

对于我的6核Intel I7(我关闭了"超线程"),我的并行性很差,6核和1核之间的差异只有30%(这意味着70%的代码是按顺序运行的,但我不知道在哪里)

对于GPU:

...
acc_init( acc_device_nvidia );
...
TimeGPU = TimeGPU + omp_get_wtime();
{
#pragma acc kernels loop independent  copyout(jcount[0:k]) copyin(MLT[0:k],dt)
for (int i = 1;i < k-1; i++)
{
jcount[i]=((int)(MLT[i]/dt))+1;
}
jcount[0]=0;
jcount[k-1]=N;
#pragma acc kernels loop independent copyout(X[0:N+1],Y[0:N+1]) copyin(MLT[0:k],MLX[0:k],MLY[0:k],dt) copy(w[0:N])
for (int i = 0;i < k - 1; i++)
{
for(int j=jcount[i];j < jcount[i+1];j++)
{
w[j] = (j*dt - MLT[i])/(MLT[i+1]-MLT[i]);
X[j] = MLX[i]*(1-w[j])+MLX[i+1]*w[j];
Y[j] = MLY[i]*(1-w[j])+MLY[i+1]*w[j];
}
}
}
TimeGPU = omp_get_wtime() - TimeGPU;

GPU(gtx1070)比6核处理器慢3倍!

Launch parameters:
GPU: pgc++ -ta=tesla:cuda9.0 -Minfo=accel -O4
CPU: pgc++ -ta=multicore -Minfo=accel -O4

k=20000,N=200万

更新:

更改GPU代码:

TimeGPU = TimeGPU + omp_get_wtime();
#pragma acc data create(jcount[0:k],w[0:N]) copyout(X[0:N+1],Y[0:N+1]) copyin(MLT[0:k],MLX[0:k],MLY[0:k],dt)
{
#pragma acc parallel loop
for (int i = 1;i < k-1; i++)
{
jcount[i]=((int)(MLT[i]/dt))+1;
}
jcount[0]=0;
jcount[k-1]=N;
#pragma acc parallel loop
for (int i = 0;i < k - 1; i++)
{
for(int j=jcount[i];j < jcount[i+1];j++)
{
w[j] = (j*dt - MLT[i])/(MLT[i+1]-MLT[i]);
X[j] = MLX[i]*(1-w[j])+MLX[i+1]*w[j];
Y[j] = MLY[i]*(1-w[j])+MLY[i+1]*w[j];
}
}
}
TimeGPU = omp_get_wtime() - TimeGPU;
Launch parameters:
pgc++ -ta=tesla:managed:cuda9.0 -Minfo=accel -O4

现在GPU比CPU 慢2倍

输出:

139: compute region reached 1 time
139: kernel launched 1 time
grid: [157]  block: [128]
device time(us): total=425 max=425 min=425 avg=425
elapsed time(us): total=509 max=509 min=509 avg=509
139: data region reached 2 times
139: data copyin transfers: 1
device time(us): total=13 max=13 min=13 avg=13
146: compute region reached 1 time
146: kernel launched 1 time
grid: [157]  block: [128]
device time(us): total=13,173 max=13,173 min=13,173 avg=13,173
elapsed time(us): total=13,212 max=13,212 min=13,212 avg=13,212

为什么使用PGI_ACC_TIME=1时,TimeGPU比Output大2倍?(30ms vs 14ms)

我认为很多GPU时间都是由于内核的内存访问不足。理想情况下,您希望向量访问连续数据。

"j"循环有多少次迭代?如果长度超过32,那么您可以尝试在其上添加一个"#pragma acc循环向量",这样它将在向量之间并行,并为您提供更好的数据访问。

此外,您还有很多冗余的内存获取。考虑将具有"i"索引的数组中的值设置为临时变量,以便只从内存中提取一次值。

最新更新