我正在使用造纸厂库同时使用多处理运行多个笔记本。
这发生在 Python 3.6.6、Red Hat 4.8.2-15 的 Docker 容器中。
但是,当我运行python脚本时,由于我收到此错误,大约5%的笔记本无法立即工作(没有Jupyter Notebook单元运行(:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "<decorator-gen-124>", line 2, in initialize
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 469, in initialize
self.init_sockets()
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 238, in init_sockets
self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
File "/opt/conda/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
s.bind("tcp://%s:%i" % (self.ip, port))
File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
以及:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "main.py", line 77, in run_papermill
pm.execute_notebook(notebook, output_path, parameters=config)
File "/opt/conda/lib/python3.6/site-packages/papermill/execute.py", line 104, in execute_notebook
**engine_kwargs
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 49, in execute_notebook_with_engine
return self.get_engine(engine_name).execute_notebook(nb, kernel_name, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 304, in execute_notebook
nb = cls.execute_managed_notebook(nb_man, kernel_name, log_output=log_output, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/engines.py", line 372, in execute_managed_notebook
preprocessor.preprocess(nb_man, safe_kwargs)
File "/opt/conda/lib/python3.6/site-packages/papermill/preprocess.py", line 20, in preprocess
with self.setup_preprocessor(nb_man.nb, resources, km=km):
File "/opt/conda/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
self.km, self.kc = self.start_new_kernel(**kwargs)
File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
kc.wait_for_ready(timeout=self.startup_timeout)
File "/opt/conda/lib/python3.6/site-packages/jupyter_client/blocking/client.py", line 104, in wait_for_ready
raise RuntimeError('Kernel died before replying to kernel_info')
RuntimeError: Kernel died before replying to kernel_info
请帮助我解决这个问题,因为我已经在网上搜索了不同的解决方案,到目前为止,没有一个对我的情况有效。
无论我同时运行的笔记本电脑数量或计算机上的内核数量如何,都会发生 5% 的错误率,这让它格外好奇。
我尝试更改启动方法并更新库,但无济于事。
我的库的版本是:
papermill==1.2.1
ipython==7.14.0
jupyter-client==6.1.3
谢谢!
明显的问题归因指向 ZeroMQ 无法成功.bind()
。
错误消息:zmq.error.ZMQError: Address already in use
更容易解释。而ZeroMQ AccessPoint-s可以出于显而易见的原因自由地尝试.connect()
到许多对应方,但只有一个可以.bind()
到特定传输类的地址目标。
发生这种情况有三个潜在原因:
1 (通过{ multiprocessing.Process | joblib.Parallel | Docker-wrapped | ... }
生成的副本意外调用一些代码(不知道内部详细信息(,
每个副本都试图获得某个 ZeroMQ 传输类地址的所有权,由于显而易见的原因,在第一个成功后,任何尝试都无法成功。
2(一个相当致命的情况,其中某些"以前"运行的进程无法释放此类传输类特定地址以供进一步使用(不记得ZeroMQ可能只是其他感兴趣的候选者之一 - 配置管理缺陷(,或者在这种情况下,以前的运行未能正常终止此类资源使用并留下Context()
-实例仍在等待(在某些情况下无限等待,直到 O/S 重新启动(侦听某些东西,这永远不会发生。
3 (模块软件设计中确实很糟糕的工程实践,而不是处理记录EADDRINUSE
错误/异常的 ZeroMQ API 比只是让整个马戏团崩溃更残酷(为此付出所有相关代价(
另一个错误消息:RuntimeError: Kernel died before replying to kernel_info
与状态有关,笔记本的内核尝试与自己的组件(池对等体(建立所有内部连接的时间太长,以至于等待时间超过配置或硬编码的超时,内核进程只是停止等待,并将自己投入到您观察到和报告的未处理的异常中。
溶液
首先检查是否有任何挂起的地址所有者,如果对此有疑问,请重新启动所有节点,然后验证没有冲突尝试"隐藏"在您自己的代码/{ multiprocessing.Process() | joblib.Parallel() | ... the likes }
调用中,分发后可能会尝试.bind()
到同一目标上。如果这些步骤都不能挽救控制区域内的问题,请询问模块使用的支持,以分析并帮助您重构和验证仍在冲突的用例。