我正在尝试将大小为[2734984行x 11列]的pyspark数据帧转换为名为toPandas()
的panda数据帧。当使用Azure Databricks笔记本时,它运行得很好(11秒(,而当我使用Databricks连接运行完全相同的代码时,我遇到了java.lang.OutOfMemoryError: Java heap space
异常(数据库连接版本和DatabricksRuntime版本匹配,均为7.1(。
我已经增加了火花驱动器内存(100g(和maxResultSize(15g(。我想错误出现在databricks连接的某个地方,因为我无法使用Notebooks复制它。
有什么线索吗?
错误如下:
Exception in thread "serve-Arrow" java.lang.OutOfMemoryError: Java heap space
at com.ning.compress.lzf.ChunkDecoder.decode(ChunkDecoder.java:51)
at com.ning.compress.lzf.LZFDecoder.decode(LZFDecoder.java:102)
at com.databricks.service.SparkServiceRPCClient.executeRPC0(SparkServiceRPCClient.scala:84)
at com.databricks.service.SparkServiceRemoteFuncRunner.withRpcRetries(SparkServiceRemoteFuncRunner.scala:234)
at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPC(SparkServiceRemoteFuncRunner.scala:156)
at com.databricks.service.SparkServiceRemoteFuncRunner.executeRPCHandleCancels(SparkServiceRemoteFuncRunner.scala:287)
at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute0$1(SparkServiceRemoteFuncRunner.scala:118)
at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$934/2145652039.apply(Unknown Source)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.service.SparkServiceRemoteFuncRunner.withRetry(SparkServiceRemoteFuncRunner.scala:135)
at com.databricks.service.SparkServiceRemoteFuncRunner.execute0(SparkServiceRemoteFuncRunner.scala:113)
at com.databricks.service.SparkServiceRemoteFuncRunner.$anonfun$execute$1(SparkServiceRemoteFuncRunner.scala:86)
at com.databricks.service.SparkServiceRemoteFuncRunner$$Lambda$1031/465320026.apply(Unknown Source)
at com.databricks.spark.util.Log4jUsageLogger.recordOperation(UsageLogger.scala:210)
at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:346)
at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:325)
at com.databricks.service.SparkServiceRPCClientStub.recordOperation(SparkServiceRPCClientStub.scala:61)
at com.databricks.service.SparkServiceRemoteFuncRunner.execute(SparkServiceRemoteFuncRunner.scala:78)
at com.databricks.service.SparkServiceRemoteFuncRunner.execute$(SparkServiceRemoteFuncRunner.scala:67)
at com.databricks.service.SparkServiceRPCClientStub.execute(SparkServiceRPCClientStub.scala:61)
at com.databricks.service.SparkServiceRPCClientStub.executeRDD(SparkServiceRPCClientStub.scala:225)
at com.databricks.service.SparkClient$.executeRDD(SparkClient.scala:279)
at com.databricks.spark.util.SparkClientContext$.executeRDD(SparkClientContext.scala:161)
at org.apache.spark.scheduler.DAGScheduler.submitJob(DAGScheduler.scala:864)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:928)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2331)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2426)
at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$6(Dataset.scala:3638)
at org.apache.spark.sql.Dataset$$Lambda$3567/1086808304.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$3(Dataset.scala:3642)```
这可能是因为Databricks connect正在客户端机器上执行toPandas,而客户端机器可能会耗尽内存。您可以通过在(本地(配置文件${spark_home}/conf/spark-defaults.conf
中设置spark.driver.memory
来增加本地驱动程序内存,其中${spark_home}
可以通过databricks-connect get-spark-home
获得。