Pyspark将文件另存为parquet并读取

我的 PySpark脚本保存创建的 DataFrame到目录：

df.write.save(full_path, format=file_format, mode=options['mode'])

如果我在同一运行中读取此文件，一切都很好：

return sqlContext.read.format(file_format).load(full_path)

但是，当我尝试在另一个脚本运行中从该目录中读取文件时，我会收到一个错误：

java.io.FileNotFoundException: File does not exist: /hadoop/log_files/some_data.json/part-00000-26c649cb-0c0f-421f-b04a-9d6a81bb6767.json

我知道我可以通过Spark的提示找到围绕它的作品：

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

但是，我想知道我失败的原因，这是这样一个问题的正统方法？

您正在尝试管理与同一文件相关的两个对象，因此涉及此对象的缓存将给您带来问题，它们都针对同一文件。这里有一个简单的解决方案，

https://stackoverflow.com/a/60328199/5647992

相关内容

最新更新

热门标签：