py4j.protocol.Py4JJava错误:找不到类org.apache.hadoop.fs.azure.Nati



我正在尝试从pyspark读取csv文件,而读取它会抛出

py4j.protocol.Py4JJava错误:错误调用o30.csv时发生。:java.lang.RuntimeException:java.lang.ClassNotFoundException:类在找不到org.apache.hadoop.fs.azure.NativeAzureFileSystemorg.apache.hadop.conf.Configuration.getClass(Configuration.java:2595(在org.apache.hoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269(在org.apache.hoop.fs.FileSystem.createFileSystem(FileSystem.java:3301(网址:org.apache.hoop.fs.FileSystem.access$200(FileSystem.java:124(在org.apache.hoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352(网址:org.apache.hoop.fs.FileSystem$Cache.get(FileSystem.java:3320(网址:org.apache.hoop.fs.FileSystem.get(FileSystem.java:479(org.apache.hoop.fs.Path.getFileSystem(Path.java:361(org.apache.spark.sql.exexecution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46(在org.apache.spark.sql.exexecution.datasources.DataSource.resolveRelation(DataSource.scala:376(在org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326(在org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308(在scala。Option.getOrElse(Option.scala:189(org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308(在org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:796(位于的sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法(sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62(在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43(位于java.lang.reflect.Method.ioke(Method.java:498(位于的py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244(位于的py4j.reflection.ReflectionEngine.reinvoke(ReflectionEngine.java:357(py4j.Gateway.ioke(Gateway.java:282(位于py4j.commands.AbstractCommand.invokeMethod(AbstractCmd.java:132(在py4j.commands.CallCommand.execute(CallCommand.java:79(py4j.GatewayConnection.run(GatewayConnection.java:238(位于java.lang.Thread.run(Thread.java:748(原因:java.lang.ClassNotFoundException:类在找不到org.apache.hadoop.fs.azure.NativeAzureFileSystemorg.apache.hadop.conf.Configuration.getClassByName(Configuration.java:2499(在org.apache.hadop.conf.Configuration.getClass(Configuration.java:2593(…再增加25个

进程结束,退出代码为1

错误,所以有人能告诉我我在下面的代码中哪里做错了吗。

from pyspark.sql import SparkSession
SECRET_ACCESS_KEY = "XXXXXXXXXXX"
STORAGE_NAME = "azuresvkstorageaccount11123"
CONTAINER = "inputstorage1"
FILE_NAME = "movies.csv"

spark = SparkSession.builder.appName("Azure_PySpark_Connectivity")
.master("local[*]")
.getOrCreate()
fs_acc_key = "fs.azure.account.key." + STORAGE_NAME + ".blob.core.windows.net"
spark.conf.set("spark.hadoop.fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(fs_acc_key, SECRET_ACCESS_KEY)
file_path = "wasb://inputstorage1@azuresvkstorageaccount.blob.core.windows.net/movies.csv"
print(file_path)
Df = spark.read.csv(path=file_path,header=True,inferSchema=True) #Error Coming from this line it is unable to read the csv file
#Df.show(20,True)

我已经解决了问题,它来自maven jar,解决方案是

  1. 从maven门户手动下载hadoop azure和azure存储jar,并将这些jar复制到spark/jars/。文件夹
  2. 对jetty-utils jar执行同样的操作,将此jar添加到spark/jars/。文件夹
  3. 然后刷新并再次运行脚本,它可以完美地工作

最新更新