java.lang.StackOverflowError保存文件parquet pyspark &g



我运行了一个胶水工作,它存在一个错误java.lang.StackOverflowError保存文件到拼花时。我的数据框有超过400k行和250列。下面是日志:

File "/tmp/glue-job.py", line 147, in transform_to_column_based_format
.save(s3_output_folder_url)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save
self._jwrite.save(path)
File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o1659.save.
: java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:188)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:387)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:423)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:255)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:421)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:369)
at org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:192)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:387)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:423)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:255)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:421)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:369)
at org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:192)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:387)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:423)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:255)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:421)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:369)
at org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:192)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:423)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:255)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:421)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:369)
at org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:192)
at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193)
at org.apache.spark.sql.c
2022-12-16 13:37:28,789 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last): File "/tmp/glue-job.py", line 228, in <module> LeadDMSMirror().main() File "/tmp/glue-job.py", line 224, in main self.load_and_update_to_delta_table(table_name=self.historical_table_name, primary_key=self.current_table_name_pk, is_history_table=True) File "/tmp/glue-job.py", line 214, in load_and_update_to_delta_table self.transform_to_column_based_format(current_df, full_load_df, primary_key, s3_output_folder_url, is_full_load=True ) File "/tmp/glue-job.py", line 147, in transform_to_column_based_format .save(s3_output_folder_url) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1109, in save self._jwrite.save(path) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o1659.save. : java.lang.StackOverflowError at org.apache.spark.sql.catalyst.plans.QueryPlan.rewrite$1(QueryPlan.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformUpWithNewOutput$1(QueryPlan.scala:193) at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:387) at 

我不知道bug是什么,希望能找到解决方法。

欢迎来到Stackoverflow!

从堆栈跟踪中,您可以看到Spark在制定查询计划时出现问题。以一种循环的方式,这闻起来像递归出错了,它不断地调用Queryplan->TreeNodeQueryPlanTreeNode→…

这是一个经典的问题,使您的堆栈溢出。作为这些问题的原因(和解决方案),我认为如下:

  • 原因:你在代码中添加了一些递归,这在某种程度上是错误的。解决方案
    • :确保你不是在循环/自身内调用函数,并尝试使你的代码尽可能简单
  • 您的数据非常复杂且嵌套,因此为此制定查询计划非常复杂解决方案
    • :增加驱动程序上的JVM堆栈大小。默认的堆栈大小(取决于您的JVM)是256kB - 1MB。以4MB为例。在这里可以找到一个如何在Pyspark中做到这一点的示例。

希望这对你有帮助!

相关内容

  • 没有找到相关文章

最新更新