Spark 独立模式:连接到 127.0.1.1:<PORT>已拒绝



我在独立模式下使用Spark 0.7.2,使用以下驱动程序处理~90GB(压缩:19GB)的日志数据,使用7个worker和1个不同的master:

System.setProperty("spark.default.parallelism", "32")
val sc = new SparkContext("spark://10.111.1.30:7077", "MRTest", System.getenv("SPARK_HOME"), Seq(System.getenv("NM_JAR_PATH")))
val logData = sc.textFile("hdfs://10.111.1.30:54310/logs/")
val dcxMap = logData.map(line => (line.split("\|")(0),   
                                  line.split("\|")(9)))
                                  .reduceByKey(_ + " || " + _)
dcxMap.saveAsTextFile("hdfs://10.111.1.30:54310/out")

阶段1的所有ShuffleMapTasks完成后:

Stage 1 (reduceByKey at DcxMap.scala:31) finished in 111.312 s

提交阶段0:

Submitting Stage 0 (MappedRDD[6] at saveAsTextFile at DcxMap.scala:38), which is now runnable

经过一些序列化后,它打印

spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host23
spark.MapOutputTracker - Size of output statuses for shuffle 0 is 2008 bytes
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host21
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host22
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host26
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host24
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host27
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host28

在此之后,什么都没有发生,top也表明工人现在都闲置了。如果我查看工作机器的日志,在每台机器上都会发生同样的事情:

13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:34288]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:36040] 
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:50467]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:60833]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:49893]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:39907]
然后,对于这些"初始连接"尝试的每个,它向每个工作器抛出相同的错误(以host27的日志为例,仅显示第一次出现错误):
13/06/21 07:32:25 WARN network.SendingConnection: Error finishing connection to host27/127.0.1.1:49893
java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
    at spark.network.SendingConnection.finishConnect(Connection.scala:221)
    at spark.network.ConnectionManager.spark$network$ConnectionManager$$run(ConnectionManager.scala:127)
    at spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:70)

为什么会发生这种情况?看起来工作人员之间可以很好地沟通,唯一的问题似乎是当他们想给自己发消息时;在上面的示例中,host27尝试向自己发送6条消息,但失败了6次。向其他工作人员发送消息可以正常工作。有人有什么主意吗?

edit:可能与spark使用127.0有关。10.1?/etc/hosts如下所示:

127.0.0.1       localhost
127.0.1.1       host27.<ourdomain>  host27

我发现这个问题与这个问题有关。然而,对我来说,在工人上设置SPARK_LOCAL_IP并没有解决这个问题。我必须将/etc/hosts更改为:

127.0.0.1       localhost

现在运行得很顺利

相关内容

  • 没有找到相关文章

最新更新