YARN上的Spark作业失败



我正在尝试使用以下配置在YARN集群中执行Spark作业。

/usr/bin/spark-submit 
--class com.example.DriverClass 
--master yarn-cluster 
app.jar 
hdfs:///user/spark/file1.parquet 
hdfs:///user/spark/file2.parquet 
hdfs:///user/spark/output
20151217052915 
--num-executors 20  
--executor-memory 12288M 
--executor-cores 5 
--driver-memory 6G 
--conf spark.yarn.executor.memoryOverhead=1332

我们使用20个执行器执行,每个执行器作为12GB内存用于此作业。

我们必须增加spark.yarn.executor.memoryOverhead属性的大小吗?

错误日志:

15/12/18 15:47:39 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 5.0 (TID 117, lpdn0185.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$next$1.apply(ExternalAppendOnlyMap.scala:336)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$next$1.apply(ExternalAppendOnlyMap.scala:331)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:331)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:227)
    at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.SubtractedRDD.integrate$1(SubtractedRDD.scala:110)
    at org.apache.spark.rdd.SubtractedRDD.compute(SubtractedRDD.scala:119)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
15/12/18 15:47:39 INFO scheduler.TaskSetManager: Starting task 2.1 in stage 5.0 (TID 119, lpdn0185.com, PROCESS_LOCAL, 4237 bytes)
15/12/18 15:47:39 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 5.0 (TID 118, lpdn0185.com): FetchFailed(BlockManagerId(2, lpdn0185..com, 37626), shuffleId=4, mapId=42, reduceId=3, message=
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/hdfs1/yarn/nm/usercache/phdpentcustcdibtch/appcache/application_1449986083135_60217/blockmgr-34a2e882-6b36-42c6-bcff-03d9bc5ef80b/0c/shuffle_4_42_0.data, offset=5899394, length=46751}
    at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
    at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
    at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
    at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.SubtractedRDD.integrate$1(SubtractedRDD.scala:110)
    at org.apache.spark.rdd.SubtractedRDD.compute(SubtractedRDD.scala:119)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Error in opening FileSegmentManagedBuffer{file=/hdfs1/yarn/nm/usercache/user1/appcache/application_1449986083135_60217/blockmgr-34a2e882-6b36-42c6-bcff-03d9bc5ef80b/0c/shuffle_4_42_0.data, offset=5899394, length=46751}
    at org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:113)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$3.apply(ShuffleBlockFetcherIterator.scala:300)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$3.apply(ShuffleBlockFetcherIterator.scala:300)
    at scala.util.Try$.apply(Try.scala:161)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)
    ... 30 more
Caused by: java.io.FileNotFoundException: /hdfs1/yarn/nm/usercache/user1/appcache/application_1449986083135_60217/blockmgr-34a2e882-6b36-42c6-bcff-03d9bc5ef80b/0c/shuffle_4_42_0.data (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:98)
    ... 35 more
)

感谢你在这方面的帮助。

我遇到同样的问题已经有好几个星期了。确切地说,每次我略有不同的错误,包括你得到的。基本上,在我的情况下,我认为,与集群能力相比,数据太大了。

简而言之,我尝试的是

  1. 增加的执行器内存
  2. 增加spark.yarn.executor.memory开销高达executedMemory的20%(默认为10%,最低384)
  3. 检查了您的内部版本和spark版本
  4. 根据集群节点的数量增加或减少执行器的数量(每个节点分配多少执行器?)
  5. 优化代码-最小化混洗例如,避免groupByKey,代之以reduceByKey、aggregateByKey或combineByKey-最小化内部缓存的临时文件例如优化的变换/变换次数
  6. 考虑了分区的数量(有多少分区的镶木地板文件?)在我的情况下,通过coalize或partitioner等重新分区不起作用,事实上,重新分区时性能变得更差

希望这能奏效!

相关内容

  • 没有找到相关文章

最新更新