我已经缩小了代码上抛出的分割错误的问题,与行有关
#pragma acc data copyout(result_mat[0:MAT1_X][0:MAT2_Y]), copyin(mat1[0:MAT1_X][0:MAT1_Y],mat2[0:MAT2_X][0:MAT2_Y])
在以下代码中:
// https://github.com/wrembish/MatMul_Parallel.git
#include <iostream>
#include <omp.h>
#include <cstdlib>
#include <ctime>
#include <chrono>
using namespace std;
using namespace std::chrono;
// constant variables for the desired size of matrix 1
const size_t MAT1_X = 835;
const size_t MAT1_Y = 835;
// constant variables for the desired size of matrix 2
const size_t MAT2_X = 835;
const size_t MAT2_Y = 835;
int main()
{
// take start time of whole program
auto prog_start = high_resolution_clock::now();
// seed rand for randomly filling the matrices
srand(time(NULL));
// define the matrices to the variables mat1 and mat2
int mat1[MAT1_X][MAT1_Y];
int mat2[MAT2_X][MAT2_Y];
// define the result matrix
int result_mat[MAT1_X][MAT2_Y];
// zero result matrix
#pragma acc loop
for(int unsigned i = 0; i < MAT1_X; i++)
{
for(int unsigned j = 0; j < MAT2_Y; j++)
{
result_mat[i][j] = 0;
}
}
// fill in mat1 with random positive integers <= 100
#pragma acc loop
for(int unsigned i = 0; i < MAT1_X; i++)
{
for(int unsigned j = 0; j < MAT1_Y; j++)
{
mat1[i][j] = (rand() % 100) + 1;
}
}
// fill in mat2 with random positive integers <= 100
#pragma acc loop
for(int unsigned i = 0; i < MAT2_X; i++)
{
for(int unsigned j = 0; j < MAT2_Y; j++)
{
mat2[i][j] = (rand() % 100) + 1;
}
}
// if the matrices can be multiplied, do it
if(MAT1_Y == MAT2_X)
{
//#pragma omp parallel for ordered schedule(auto) collapse(3)
#pragma acc data copyout(result_mat[0:MAT1_X][0:MAT2_Y]), copyin(mat1[0:MAT1_X][0:MAT1_Y],mat2[0:MAT2_X][0:MAT2_Y])
#pragma kernels
for(int unsigned i = 0; i < MAT1_X; i++)
{
//#pragma acc loop
for(int unsigned j = 0; j < MAT2_Y; j++)
{
//#pragma acc loop seq
for(int unsigned k = 0; k < MAT1_Y; k++)
{
result_mat[i][j] += mat1[i][k] * mat2[k][j];
}
}
}
} else
{
cout << "the dimensions of the two matrices don't allow multiplication" << endl;
}
// take end time of whole program
auto prog_stop = high_resolution_clock::now();
// get the difference in time between program start and finish
auto prog_duration = duration_cast<microseconds>(prog_stop - prog_start);
cout << "time taken(program): " << prog_duration.count() << " microseconds." << endl;
}
我在我的类虚拟机上使用 pgi/19.4,我正在编译和运行代码
pgc++ -ta=tesla -Minfo=accel matmul_acc.cpp
srun -p cisc372 --gres=gpu:1 ./a.out
并收到以下消息
srun: error: beowulf: task 0: Segmentation fault (core dumped)
我是openacc和pgi的新手,在过去的3个小时里,我一直在互联网上寻找修复程序。如果有人知道我的代码有什么问题,我将非常感谢任何建议或修复。抱歉,如果已经有任何类似的问题,但我找不到任何适合我的问题。
在我看来,您的段错误是由于您使用了太大的堆栈变量。减小它们的大小似乎使段错误对我来说消失(尝试 256(,但真正的解决方案是使它们动态分配。索引变得更加复杂,但随后您可以运行更大的矩阵。代码中还有其他一些 OpenACC 问题,但接下来您需要解决这些问题:
1( 初始化循环上的#pragma acc loop
指令不执行任何操作。您需要先#pragma acc parallel loop
或#pragma kernels loop
它们,然后才能并行化。由于您的数据区域位于这些循环下方,因此您可能只想完全删除循环编译指示。
2(#pragma kernels
也不做任何事情,它需要#pragma acc kernels
.如果我这样做,那么它会为 GPU 构建。