Spark Scala应用程序中的内存分配:

我正在使用spark-submit命令执行spark-Scala作业。我已经用sparksql编写了我的代码，其中我连接了两个表，并在第三个蜂巢中再次加载数据。代码运行良好，但有时我会遇到一些问题，比如内存不足问题：Java堆大小问题，超时错误。所以我想通过传递执行器、内核和内存的数量来手动控制我的工作。当我使用16个执行器、1个内核和20 GB执行器内存时，我的spark应用程序就陷入了困境。有人能建议我如何通过提供正确的参数来手动控制我的spark应用程序吗？还有没有其他的hive或spark特定的参数可以让我快速执行。

below is configuration of my cluster.

Number of Nodes: 5
Number of Cores per Node: 6
RAM per Node: 125 gb
Spark Submit Command.
spark-submit --class org.apache.spark.examples.sparksc 
--master yarn-client 
--num-executors 16 
--executor-memory 20g 
--executor-cores 1 
examples/jars/spark-examples.jar

这取决于您的数据量。您可以制作动态参数。这个链接有很好的解释如何调整火花执行器数量、核心和执行器内存？

您可以启用spark.shuffle.service.enabled，使用spark.sql.shuffle.spartitions=400，hive.exec.compress.intermediate=true，hive.xec.reducer.bytes.per.reducer=536870912，hive.ex.compress.output=true，vive.output.codcode=snappy，mapred.output.prompression.type=BLOCK

如果您的数据>700MB您可以启用spark.investigation属性

相关内容

最新更新

热门标签：