如何解决spark读取hive orc文件遇到错误



jdk 1.8scala 2.12.11火花3.0.1

当我在Scala Spark中读取hive表并写入orc文件时,它运行成功:

df.write.option("compression", "none").mode(SaveMode.Overwrite).orc(dump_path)

当我在Python PySpark中从period-export-orc文件中读取orc文件时,它也成功运行:

dfs = spark.read.orc("/Users/muller/Documents/gitcode/personEtl/knowledge_source_100.orc")

但是当我在Scala Spark中读取相同的period-export-orc时,会出现以下错误:

java.lang.ClassCastException: org.apache.orc.impl.ReaderImpl cannot be cast to java.io.Closeable
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2538)
at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.readSchema(OrcUtils.scala:65)
at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.$anonfun$readSchema$4(OrcUtils.scala:88)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.TraversableOnce.collectFirst(TraversableOnce.scala:172)
at scala.collection.TraversableOnce.collectFirst$(TraversableOnce.scala:159)
at scala.collection.AbstractIterator.collectFirst(Iterator.scala:1431)
at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.readSchema(OrcUtils.scala:88)
at org.apache.spark.sql.execution.datasources.orc.OrcUtils$.inferSchema(OrcUtils.scala:128)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:96)
at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:198)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:195)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)

我也面临同样的问题。正如堆栈跟踪显示的那样,在执行Utils.tryWithResource()方法时发生了失败。如果您查看源代码https://github.com/apache/spark/blob/v3.0.1/core/src/main/scala/org/apache/spark/util/Utils.scala,可以看到它需要一个"java.io. closable";对象。

"org.apache.orc.impl.ReaderImpl"被打包在hive-exec-**.jar中,它实现了"org.apache.orc.Reader"接口。hive-exec是一个庞大的jar包,它将所有的依赖包打包在jar包和它的包中。也不扩展java.io.Closable"类,这就是失败的原因。

请将https://repo1.maven.org/maven2/org/apache/orc/orc-core/1.5.10/orc-core-1.5.10.jar添加到您的spark驱动程序/执行器类路径中,此jar扩展了"org.apache.orc. reader"中的可闭包。接口。jar应该被添加到类路径的开头,以便"org.apache. org. reader"源自"orc-core";Jar应该首先加载,而不是从"hive- execute"加载。jar .

这样我就解决了问题。我不确定为什么在pyspark中它工作。您可以检查驱动程序/执行器类路径中的jar文件,从"org.apache.orc. reader"有人来接我

最新更新