从我的Azure Blob存储获取Excel数据时发生Databrick pyspark错误



我想使用Databrick pyspark在我的Blob存储Azure Gen2中读取一个包含多张表的excel文件。我已经安装了maven软件包。在我的代码下面:

df = spark.read.format('com.crealytics.spark.excel') 
.option("header", "true") 
.option("useHeader", "true") 
.option("treatEmptyValuesAsNulls", "true") 
.option("inferSchema", "true") 
.option("sheetName", "sheet1") 
.option("maxRowsInMemory", 10) 
.load(file_path)    

运行此代码我得到这个错误:

Py4JJava错误:调用o323.load时出错。:java.lang.NoClassDefFoundError:无法初始化类com.crealytics.spark.excel.WorkbookReader$网址:com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:22(网址:com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13(网址:com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8(网址:org.apache.spark.sql.expension.datasources.DataSource.resolveRelation(DataSource.scala:390(网址:org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:444(网址:org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:400(在scala。Option.getOrElse(Option.scala:189(网址:org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:400(网址:org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:287(在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法(位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62(在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43(位于java.lang.reflect.Method.ioke(Method.java:498(在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244(在py4j.reflection.ReflectionEngine.reinvoke(ReflectionEngine.java:380(在py4j.Gateway。invoke(Gateway。java:295(在py4j.commands.AbstractCommand.invokeMethod(AbstractCmd.java:132(在py4j.commands.CallCommand.execute(CallCommand.java:79(在py4j.GatewayConnection.run(GatewayConnection.java:251(在java.lang.Thread.run(Thread.java:748(

如有任何帮助,我们将不胜感激。感谢

能否验证是否已正确装载Azure Blob存储容器。

签出官方MS文档:使用RDD API 访问Azure Blob存储

Hadoop配置选项无法通过SparkContext访问。如果您正在使用RDD API从Azure Blob存储进行读取,您必须将Hadoop凭据配置属性设置为Spark创建集群时的配置选项,添加spark.hadoop.前缀到对应的hadoop配置密钥将它们传播到用于RDD的Hadoop配置作业

配置帐户访问密钥:

spark.hadoop.fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>

最新更新