我有一个火花作业,它因GC\堆空间错误而失败。当我检查终端时,我可以看到堆栈跟踪:
Caused by: org.spark_project.guava.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2261)
at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:890)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:357)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:121)
at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
... 77 more
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:703)
at java.util.HashMap.putVal(HashMap.java:628)
at java.util.HashMap.putMapEntries(HashMap.java:514)
at java.util.HashMap.putAll(HashMap.java:784)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3073)
at org.codehaus.janino.UnitCompiler.access$4900(UnitCompiler.java:206)
at org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:2958)
at org.codehaus.janino.UnitCompiler$8.visitLocalVariableDeclarationStatement(UnitCompiler.java:2926)
at org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:2974)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2925)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:3033)
at org.codehaus.janino.UnitCompiler.access$4400(UnitCompiler.java:206)
at org.codehaus.janino.UnitCompiler$8.visitSwitchStatement(UnitCompiler.java:2950)
at org.codehaus.janino.UnitCompiler$8.visitSwitchStatement(UnitCompiler.java:2926)
at org.codehaus.janino.Java$SwitchStatement.accept(Java.java:2866)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2925)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2982)
at org.codehaus.janino.UnitCompiler.access$3800(UnitCompiler.java:206)
at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:2944)
at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:2926)
at org.codehaus.janino.Java$Block.accept(Java.java:2471)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2925)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2999)
at org.codehaus.janino.UnitCompiler.access$4000(UnitCompiler.java:206)
at org.codehaus.janino.UnitCompiler$8.visitForStatement(UnitCompiler.java:2946)
at org.codehaus.janino.UnitCompiler$8.visitForStatement(UnitCompiler.java:2926)
at org.codehaus.janino.Java$ForStatement.accept(Java.java:2660)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2925)
at org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2982)
at org.codehaus.janino.UnitCompiler.access$3800(UnitCompiler.java:206)
at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:2944)
at org.codehaus.janino.UnitCompiler$8.visitBlock(UnitCompiler.java:2926)
问题是堆栈跟踪没有出现在我使用 webUI 或直接检查磁盘上文件的任何工作日志(stdout 和 stderr(上。
我在应用程序上确实有一个失败的执行器,它只是显示(stdout(:
17:12:17,008 ERROR [TransportResponseHandler] Still have 1 requests outstanding when connection from /<IP1>:35482 is closed
17:12:17,010 ERROR [CoarseGrainedExecutorBackend] Executor self-exiting due to : Driver <IP1>:35482 disassociated! Shutting down.
标准文件为空。
这对我来说是一个大问题,因为我并不总是在控制台中看到整个日志/堆栈跟踪,并且我正在寻找可靠/持久模式的东西。
org.codehaus.janino
包用于整个阶段的Java代码生成(请参阅堆栈跟踪中带有org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute
的行(,该代码作为查询优化的一部分(在RDD准备好执行之前(发生在驱动程序上。
任何工作问题是堆栈跟踪没有出现在我使用 webUI 或直接检查磁盘上文件的任何工作日志(stdout 和 stderr(上。
线程日志中都不应有堆栈跟踪,因为尚未提交任何内容以在执行器(因此在工作线程(上执行。在执行程序执行它之前,它已经失败了。