Apache Spark - ERROR RetryingBlockFetcher:异常,而开始获取1个未完成的块.&l



我在一个集群中运行一个Spark Job,配置如下:

--master yarn --deploy-mode client
--executor-memory 4g 
--executor-cores 2 
--driver-memory 6g 
--num-executors 12 

当我在驱动程序中获取数据样本时,问题发生在工作中。运行的命令如下:

rddTuplesA.sample(false, 0.03, 261).collect().forEach((tuple) ->
                    //build histogram...
            ); 

rddTuplesA对象为JavaRDD<Tuple3<String, Double, Double>>类型。

Job抛出以下错误:

22/04/14 23:19:22 ERROR RetryingBlockFetcher: Exception whilejava.io.IOException: Failed连接到snf-8802/192.168.0.6:35615org.apache.spark.network.client.TransportClientFactory.createClient (TransportClientFactory.java: 287)org.apache.spark.network.client.TransportClientFactory.createClient (TransportClientFactory.java: 218)在另一次2.美元美元org.apache.spark.network.netty.NettyBlockTransferService createandstart (NettyBlockTransferService.scala: 123)org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding (RetryingBlockFetcher.java: 153)org.apache.spark.network.shuffle.RetryingBlockFetcher.start (RetryingBlockFetcher.java: 133)org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks (NettyBlockTransferService.scala: 143)org.apache.spark.network.BlockTransferService.fetchBlockSync (BlockTransferService.scala: 102)org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer (BlockManager.scala: 1061)在org.apache.spark.storage.BlockManager。anonfun getRemoteBlock美元8美元(BlockManager.scala: 1005)scala.Option.orElse (Option.scala: 447)org.apache.spark.storage.BlockManager.getRemoteBlock (BlockManager.scala: 1005)org.apache.spark.storage.BlockManager.getRemoteBytes (BlockManager.scala: 1143)在org.apache.spark.scheduler.TaskResultGetter立刻3美元。美元anonfun运行1美元(TaskResultGetter.scala: 88)在scala.runtime.java8.JFunction0专门sp.apply美元(美元JFunction0 mcV $ sp.java: 23)org.apache.spark.util.Utils .logUncaughtExceptions美元(Utils.scala: 1996)在org.apache.spark.scheduler.TaskResultGetter不久美元3.美元运行(TaskResultGetter.scala: 63)java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java: 1149)java.util.concurrent.ThreadPoolExecutor Worker.run美元(ThreadPoolExecutor.java: 624)在java.lang.Thread.run(Thread.java:748)引起的:io.net .channel. abstractchannel $AnnotatedConnectException: Connection拒绝:snf-8802/192.168.0.6:35615连接被拒绝在sun.nio.ch.SocketChannelImpl。checkConnect(本地方法)sun.nio.ch.SocketChannelImpl.finishConnect (SocketChannelImpl.java: 714)io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect (NioSocketChannel.java: 330)io.netty.channel.nio.AbstractNioChannel AbstractNioUnsafe.finishConnect美元(AbstractNioChannel.java: 334)io.netty.channel.nio.NioEventLoop.processSelectedKey (NioEventLoop.java: 702)io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized (NioEventLoop.java: 650)io.netty.channel.nio.NioEventLoop.processSelectedKeys (NioEventLoop.java: 576)io.netty.channel.nio.NioEventLoop.run (NioEventLoop.java: 493)在io.netty.util.concurrent.SingleThreadEventExecutor 4.美元运行(SingleThreadEventExecutor.java: 989)在io.netty.util.internal.ThreadExecutorMap 2.美元运行(ThreadExecutorMap.java: 74)io.netty.util.concurrent.FastThreadLocalRunnable.run (FastThreadLocalRunnable.java: 30)java.lang.Thread.run (Thread.java: 748)

然而,当我得到一个较小的样本时,工作就会完美地工作。

rddTuplesA.sample(false, 0.01, 261).collect().forEach((tuple) ->
                        //build histogram...
                ); 

是否需要更改任何配置参数以使作业运行?这个问题似乎与网络有关。此外,如果这是由于内存问题而发生的,驱动程序上不会有内存相关的错误吗?比如:

. lang。OutOfMemoryError: GC overhead limit exceeded

终于解开了谜团。这个问题与集群网络有关。具体来说,我在每台机器(节点)的/etc/hosts文件中添加了它们的本地ip映射到它们的主机名(作为别名),如:

192.168.0.1 snf-1234
192.168.0.2 snf-1235
...

似乎当样本很大时,驱动程序试图建立一个由于ipv4和主机名之间缺少匹配而无法实现的连接。

最新更新