Apache Arrow OutOfMemoryException 当 PySpark 读取 Hive table to



我搜索了这种错误,但找不到有关如何解决它的任何信息。这是我执行以下两个脚本时得到的:

org.apache.arrow.memory.OutOfMemoryException: Failure while allocating memory.

write.py

import pandas as pd
from pyspark.sql import SparkSession
from os.path import abspath
warehouse_location = abspath('spark-warehouse')
booksPD = pd.read_csv('books.csv')
spark = SparkSession.builder 
.appName("MyApp") 
.master("local[*]") 
.config("spark.sql.execution.arrow.enabled", "true") 
.config("spark.driver.maxResultSize", "16g") 
.config("spark.python.worker.memory", "16g") 
.config("spark.sql.warehouse.dir", warehouse_location) 
.enableHiveSupport() 
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark.createDataFrame(booksPD).write.saveAsTable("books")
spark.catalog.clearCache()

read.py

from pyspark.sql import SparkSession
from os.path import abspath
warehouse_location = abspath('spark-warehouse')
spark = SparkSession.builder 
.appName("MyApp") 
.master("local[*]") 
.config("spark.sql.execution.arrow.enabled", "true") 
.config("spark.driver.maxResultSize", "16g") 
.config("spark.python.worker.memory", "16g") 
.config("spark.sql.warehouse.dir", warehouse_location) 
.enableHiveSupport() 
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
books = spark.sql("SELECT * FROM books").toPandas()

最有可能的是,必须增加内存限制。附加以下配置以增加驱动程序和执行程序内存可以解决我的问题。

.config("spark.driver.memory", "16g") 
.config("spark.executor.memory", "16g") 

由于程序配置为在本地模式下运行(.master("local[*]")(,驱动程序也将获得一些负载,并且需要足够的内存。

最新更新