YARN 抱怨 java.net.NoRouteToHostException:没有到主机的路由(主机无法访问)

尝试在 HDP 3.1 集群上运行 h2o 并遇到似乎与 YARN 资源容量有关的错误...

[ml1user@HW04 h2o-3.26.0.1-hdp3.1]$ hadoop jar h2odriver.jar -nodes 3 -mapperXmx 10g
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 192.168.122.1]
[Possible callback IP address: 172.18.4.49]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.49:46015
(You can override these with -driverif and -driverport/-driverportrange and/or specify external IP using -extdriverif.)
Memory Settings:
mapreduce.map.java.opts:     -Xms10g -Xmx10g -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent:        10
mapreduce.map.memory.mb:     11264
Hive driver not present, not generating token.
19/07/25 14:48:05 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:48:06 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
19/07/25 14:48:07 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/ml1user/.staging/job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: number of splits:3
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.JobSubmitter: Executing with tokens: []
19/07/25 14:48:08 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
19/07/25 14:48:08 INFO impl.YarnClientImpl: Submitted application application_1564020515809_0006
19/07/25 14:48:08 INFO mapreduce.Job: The url to track the job: http://HW01.ucera.local:8088/proxy/application_1564020515809_0006/
Job name 'H2O_47159' submitted
JobTracker job ID is 'job_1564020515809_0006'
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'
Waiting for H2O cluster to come up...
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
19/07/25 14:50:19 INFO impl.YarnClientImpl: Killed application application_1564020515809_0006
Killed.
19/07/25 14:50:23 INFO client.RMProxy: Connecting to ResourceManager at hw01.ucera.local/172.18.4.46:8050
19/07/25 14:50:23 INFO client.AHSProxy: Connecting to Application History server at hw02.ucera.local/172.18.4.47:10200
----- YARN cluster metrics -----
Number of YARN worker nodes: 3
----- Nodes -----
Node: http://HW03.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW04.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
Node: http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
----- Queues -----
Queue name:            default
Queue state:       RUNNING
Current capacity:  0.00
Capacity:          1.00
Maximum capacity:  1.00
Application count: 0
Queue 'default' approximate utilization: 0.0 / 45.0 GB used, 0 / 9 vcores used
----------------------------------------------------------------------
ERROR: Unable to start any H2O nodes; please contact your YARN administrator.
A common cause for this is the requested container size (11.0 GB)
exceeds the following YARN settings:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1564020515809_0006'

在 Ambari UI 中的 YARN 配置中查找，找不到这些属性。但是检查 YARN 资源管理器 UI 中的 YARN 日志并检查已杀死应用程序的一些日志，我看到似乎是无法访问的主机错误......

Container: container_e05_1564020515809_0006_02_000002 on HW03.ucera.local_45454_1564102219781
LogAggregationType: AGGREGATED
=============================================================================================
LogType:stderr
LogLastModifiedTime:Thu Jul 25 14:50:19 -1000 2019
LogLength:2203
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/filecache/11/mapreduce.tar.gz/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop/yarn/local/usercache/ml1user/appcache/application_1564020515809_0006/filecache/10/job.jar/job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapred.YarnChild).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.net.PlainSocketImpl.socketConnect(Native Method)
....
at java.net.Socket.<init>(Socket.java:211)
at water.hadoop.EmbeddedH2OConfig$BackgroundWriterThread.run(EmbeddedH2OConfig.java:38)
End of LogType:stderr
***********************************************************************

记下"java.net.NoRouteToHostException： No route to host (Host unreachable)"。但是，我可以相互访问所有其他节点，它们都可以相互 ping 对方，所以不确定这里发生了什么。对调试或修复有什么建议吗？

我想我发现了问题，TLDR：firewalld(在 centos7 上运行的节点)仍在运行，什么时候应该在 HDP 集群上禁用。

来自另一个社区帖子：

为了使 Ambari 在安装过程中与其部署和管理的主机进行通信，某些端口必须处于打开状态且可用。最简单的方法是暂时禁用 iptables，如下所示：

systemctl disable firewalld

service firewalld stop

因此，显然需要在整个集群中禁用iptables和firewalld(支持文档可以在这里找到，我只在 Ambari 安装节点上禁用了它们)。在集群中停止这些服务后(我建议使用 clush)，能够运行 yarn 作业而不会发生任何事件。

通常，此问题是由于错误的 DNS 配置、防火墙或网络无法访问造成的。引用这个官方文档：

配置文件中远程计算机的主机名错误

客户端的主机表/etc/hosts 具有目标主机的无效 IPAddress。

DNS 服务器的主机表具有目标主机的无效 IPAddress。

客户端的路由表(在 Linux 中为 iptables)是错误的。

DHCP 服务器正在发布错误的路由信息。

客户端和服务器位于不同的子网上，并且未设置为相互通信。这可能是一个意外，也可能是故意锁定Hadoop集群。

计算机正在尝试使用 IPv6 进行通信。Hadoop目前不支持IPv6

主机的 IP 地址已更改，但长期存在的 JVM 正在缓存旧值。这是 JVM 的一个已知问题(搜索"java 负 DNS 缓存"以获取详细信息和解决方案)。快速解决方案：重新启动 JVM

对我来说，问题在于驱动程序位于 Docker 容器内，这使得工作人员无法将数据发送回它。换句话说，辅助角色和驱动程序不在同一子网中。本答案中给出的解决方案是设置以下配置：

spark.driver.host=<container's host IP accessible by the workers>
spark.driver.bindAddress=0.0.0.0
spark.driver.port=<forwarded port 1>
spark.driver.blockManager.port=<forwarded port 2>

相关内容

最新更新

热门标签：