mpi4py在使用spawn后进入死锁

我有以下代码安排:

parent.py:

from mpi4py import MPI
... some code ...
for i in range(10):
... some code ...
child_comm = MPI.COMM_SELF.Spawn(sys.executable, args=["runscript_airfoil.py"], maxprocs=9)
child_comm.Barrier()
child_comm.Disconnect()
... some code ...

child.py:

from mpi4py import MPI
... some code ...
comm = MPI.COMM_WORLD
comm.Barrier()

这里的主要目标是在a中一次又一次地运行具有多个处理器的child.py。我在这里使用了Barrier()方法，因为我想让程序等待直到child.py被执行。

但是，程序只是在第一次迭代后停止。我觉得这个项目要陷入僵局了。此外，child.py使用的所有处理器都应该被释放，以便我可以在下一个循环中使用它们。

我是MPI和mpi4py的新手，所以我不知道在哪里使用什么函数。任何帮助实现这个将是非常有用的。

编辑1:

根据注释，我修改了child.py的内容。

from mpi4py import MPI
... some code ...
comm = MPI.COMM_WORLD
parent_comm = comm.Get_parent()
comm.Barrier()
parent_comm.Disconnect()

程序在第一次迭代后仍然卡住。

编辑2:

根据注释，我进一步修改了child.py的内容

from mpi4py import MPI
... some code ...
comm = MPI.COMM_WORLD
parent_comm = comm.Get_parent()
parent_comm.Barrier()
parent_comm.Disconnect()

程序没有进入死锁，但是当它试图在第二次迭代中生成时，它给出了以下错误:系统中没有足够的可用插槽来满足应用程序请求的9个插槽。我的笔记本电脑总共有10个处理器，在第一次迭代中，1个运行parent.py，其余9个运行child.py。当parent.py第二次尝试使用9个处理器生成child.py时，它没有使用先前使用的9个处理器，而是试图找到9个新处理器(不可用)。我认为之前的刷出并没有完全退出。为了测试这个理论，我运行了原始的parent.py和child.py(从第二次编辑)，maxprocs为3，循环三次。

使用什么命令来完全释放处理器?

编辑3:

我在编辑2末尾的评估是而不是正确的。我发现，当我将maxprocs保持为4或更小时，无论循环的数量如何，它都能正常工作。当我将maxprocs设置为5或更多时，它才开始"没有足够的插槽"。错误。我不知道这里出了什么问题。

下面的MWE工作没有陷入死锁(感谢@Giles在评论部分的讨论!):

parent.py:

from mpi4py import MPI
comm = MPI.COMM_WORLD
for i in range(10):
print("Start {}".format(i))
child_comm = MPI.COMM_WORLD.Spawn(sys.executable, "child.py", maxprocs=9)
child_comm.Disconnect()
print("End {}".format(i))

child.py:

import time
from mpi4py import MPI
comm = MPI.COMM_WORLD
time.sleep(comm.rank)
print(comm.rank)
parent_comm = comm.Get_parent()
parent_comm.Disconnect()

这在mpi4py教程中也有描述(应该以前见过)。

在此之后，我遇到了另一个错误。当我以python parent.py的身份运行parent.py时，我通常会得到以下输出:

Start 0
0
3
4
7
1
6
2
8
5
End 0
Start 1
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 9
slots that were requested by the application:
/home/pavan/miniconda3/envs/codelab/bin/python
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
Traceback (most recent call last):
File "parent.py", line 30, in <module>
child_comm = MPI.COMM_WORLD.Spawn(sys.executable, "child.py", maxprocs=9)
File "mpi4py/MPI/Comm.pyx", line 1931, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_SPAWN: could not spawn processes

我的笔记本电脑有10个内核(使用lscpu验证，我使用mpirun命令运行带有10个处理器的脚本)。我运行脚本作为mpirun -n 1 python parent.py，但我仍然得到相同的错误。

经过一番折腾，我发现在parent.py中Disconnect方法后添加一个小暂停可以工作:parent.py:

import time
from mpi4py import MPI
comm = MPI.COMM_WORLD
for i in range(10):
print("Start {}".format(i))
child_comm = MPI.COMM_WORLD.Spawn(sys.executable, "child.py", maxprocs=9)
child_comm.Disconnect()
time.sleep(0.25)
print("End {}".format(i))

Start 0
7
6
8
4
5
1
0
3
2
End 0
Start 1
7
8
1
6
5
3
2
4
0
End 1

我不知道为什么这会起作用，但我建议在Disconnect方法结束之前，for循环正在尝试生成下一组进程。因此，添加一个小暂停给Disconnect方法一些时间来完成。我不确定在使用MPI(或mpi4py)时这是否是通常的事情，但如果有一种优雅的方法来克服这个问题，请告诉我。

相关内容

最新更新

热门标签：