我最近开始学习MPI,正如你可能已经猜到的,我已经遇到了一个我自己无法解决的错误!
我想写一个程序,将两个矩阵相乘。然而我还没有走那么远,事实上,我一开始就一直在广播矩阵。
#define MASTER 0
if (rank == MASTER) {
A = (double *) malloc(N * N * sizeof(double));
B = (double *) malloc(N * N * sizeof(double));
matFillRand(N, A);
matFillRand(N, B);
}
if (rank == MASTER) {
P = (double *) malloc(N * N * sizeof(double));
}
matMulMPI(N, A, B, P);
if (rank == MASTER) {
printMatrix(N, P);
}
(理论上)进行数学运算的函数如下:
void matMulMPI(long N, double *a, double *b, double *c) {
long i, j, k;
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Bcast(&N, 1, MPI_LONG, MASTER, MPI_COMM_WORLD);
MPI_Bcast(b, N*N, MPI_DOUBLE, MASTER, MPI_COMM_WORLD);
printMatrix(N, b);
//TO-DO: Broadcast A
//TO-DO: Do Math
}
这个广播不起作用。我收到以下消息:
*处理接收到的信号信号:分段故障(11)信号代码:无效权限(2)地址失败:0x401560信号:分段故障(11)信号代码:无效权限(2)失败地址:0x401560[0]/lib/x86_64-linux-gnu/libphread.so.0(+0x10340)[0x7fc3ede6b340][1]/lib/x86_64-linux-gnu/libc.so.6(+0x981c0)[0x7fc3edb2e1c0][2]/usr/lib/libmpi.so.1(opal_convertor_unpack+0x105)[0x7fc3ee1788d5][3]/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x460)[0x7fc3e6587630][4]/usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x487)[0x7fc3e572a137][5]/usr/lib/libmpi.so.1(opal_progress+0x5a)[0x7fc3ee1849ea][6]/usr/lib/libmpi.so.1(ompi_request_default_wait+0x16d)[0x7fc3ee0d1c0d][7]/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_generic+0x49e)[0x7fc3e486da9e][8]/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_binomical+0xb7)[0x7fc3e486df27][9]/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)[0x7fc3e486573c][10]/usr/lib/openmpi/lib/openmpi/mca_coll_sync.so(mca_coll_sync_bcast+0x64)[0x7fc3e4a7d6a4][11]/usr/lib/libmpi.so.1(MPI_Bast+0x13d)[0x7fc3ee0df78d][12]/matMul()[0x401a9][13]/matMul()[0x401458][14]/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fc3edab7ec5][15]/matMul()[0x400b49]错误消息结束[0]/lib/x86_64-linux-gnu/libphread.so.0(+0x10340)[0x7fa4a1fe340][1]/lib/x86_64-linux-gnu/libc.so.6(+0x981c0)[0x7fa4a1ca81c0][2]/usr/lib/libmpi.so.1(opal_convertor_unpack+0x105)[0x7fa4a22f28d5][3]/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_match+0x460)[0x7fa49a701630][4]/usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x487)[0x7fa4998a4137][5]/usr/lib/libmpi.so.1(opal_progress+0x5a)[0x7fa4a22fe9ea][6]/usr/lib/libmpi.so.1(ompi_request_default_wait+0x16d)[0x7fa4a224bc0d][7]/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_generic+0x4e0)[0x7fa4989e7ae0][8]/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_binomical+0xb7)[0x7fa4989e7f27][9]/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)[0x7fa4989df73c][10]/usr/lib/openmpi/lib/openmpi/mca_coll_sync.so(mca_coll_sync_bcast+0x64)[0x7fa498bf76a4][11]/usr/lib/libmpi.so.1(MPI_Bast+0x13d)[0x7fa4a252978d][12]/matMul()[0x401a9][13]/matMul()[0x401458][14]/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fa4a1c1c1ec5][15]/matMul()[0x400b49]错误消息结束*---------------------------------------------------------------------mpirun注意到节点上PID为12466的进程排名2rtidev5.etf.bg.ac.rs在信号11时退出(分段故障)。--------------------------------------------------------------------------总共有2个进程被终止(有些进程可能在清理过程中被mpirun终止)
我已经想通了。所有进程(不仅仅是主进程)都需要首先分配内存。
所以丢失的线路是
void matMulMPI(long N, double *a, double *b, double *c) {
...
MPI_Bcast(&N, 1, MPI_LONG, MASTER, MPI_COMM_WORLD);
b = (double *) malloc(N * N * sizeof(double));
MPI_Bcast(b, N*N, MPI_DOUBLE, MASTER, MPI_COMM_WORLD);
...
}