在 MPI 下运行时使用非线性求解器时,优化挂起



我正在尝试使用OpemMDAO中的无梯度算法(例如简单遗传算法(来解决优化问题,利用MPI的并行函数评估。当我的问题没有周期时,我不会遇到任何问题。但是,一旦我必须使用非线性求解器来收敛一个循环,该过程将在其中一个秩的nl_solver完成后无限期挂起。

下面是一个代码示例 (solve_sellar.py(:

import openmdao.api as om
from openmdao.test_suite.components.sellar_feature import SellarMDA
from openmdao.utils.mpi import MPI
if not MPI:
rank = 0
else:
rank = MPI.COMM_WORLD.rank

if __name__ == "__main__":
prob = om.Problem()
prob.model = SellarMDA()
prob.model.add_design_var('x', lower=0, upper=10)
prob.model.add_design_var('z', lower=0, upper=10)
prob.model.add_objective('obj')
prob.model.add_constraint('con1', upper=0)
prob.model.add_constraint('con2', upper=0)
prob.driver = om.SimpleGADriver(run_parallel=(MPI is not None), bits={"x": 32, "z": 32})
prob.setup()
prob.set_solver_print(level=0)
prob.run_driver()
if rank == 0:
print('minimum found at')
print(prob['x'][0])
print(prob['z'])
print('minumum objective')
print(prob['obj'][0])

如您所见,此代码旨在使用 OpenMDAO 中包含的SimpleGADriver来解决 Sellar 问题。当我简单地串行(python3 solve_sellar.py(运行此代码时,我会在一段时间后得到一个结果和以下输出:

Unable to import mpi4py. Parallel processing unavailable.
NL: NLBGSSolver 'NL: NLBGS' on system 'cycle' failed to converge in 10 iterations.
<string>:1: RuntimeWarning: overflow encountered in exp
NL: NLBGSSolver 'NL: NLBGS' on system 'cycle' failed to converge in 10 iterations.
minimum found at
0.0
[0. 0.]
minumum objective
0.7779677271254263

如果我使用 MPI (mpirun -np 16 python3 solve_sellar.py( 运行它,我会得到以下输出:

NL: NLBJSolver 'NL: NLBJ' on system 'cycle' failed to converge in 10 iterations.

然后一无所有。该命令挂起并阻止分配的处理器,但没有进一步的输出。最终我用 CTRL-C 杀死了这个命令。然后,进程在以下输出后继续挂起:

[mpiexec@eb26233a2dd8] Sending Ctrl-C to processes as requested
[mpiexec@eb26233a2dd8] Press Ctrl-C again to force abort

因此,我必须强制中止该过程:

Ctrl-C caught... cleaning up processes
[proxy:0:0@eb26233a2dd8] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[proxy:0:0@eb26233a2dd8] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@eb26233a2dd8] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec@eb26233a2dd8] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@eb26233a2dd8] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@eb26233a2dd8] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@eb26233a2dd8] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

您应该能够在任何启用MPI的OpenMDAO环境中重现它,但是我也制作了一个Dockerfile以确保环境一致:

FROM danieldv/hode:latest
RUN pip3 install --upgrade openmdao==2.9.0
ADD . /usr/src/app
WORKDIR /usr/src/app
CMD mpirun -np 16 python3 solve_sellar.py

有人对如何解决这个问题有建议吗?

感谢您报告此事。 是的,这看起来像是我们在某些求解器上修复 MPI 范数计算时引入的错误。

此错误现已从提交 c4369225f43e56133d5dd4238d1cdea07d76ecc3 开始修复。您可以通过从 OpenMDAO github 存储库中提取最新版本来访问此修复程序,或者等到下一个版本(将是 2.9.2(。

最新更新