如何解决从s3读取时的火花错误



我得到一个错误(java.io. net)。运行spark应用时,IOException: No FileSystem for scheme: S3a)。我已经看过关于这种错误的各种其他问题,但我无法确定解决方案。Spark是3.1.2版本

更新了下面的细节以反映当前状态

pyspark脚本:


import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.4 pyspark-shell'
from pyspark.sql import SparkSession

spark = SparkSession.builder 
.appName("s3reader") 
.getOrCreate()
sc = spark.sparkContext
#sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xxxxxxxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint","xxx.x.xxx.x.com", "us-1-east")
#sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
print(df)

这里是我的jar版本:

cloud@spark-dev-master:/usr/local/spark/jars$ ls -ltr *aws*
-rw-rw-r-- 1 cloud cloud 126287 Aug 18  2016 hadoop-aws-2.7.4.jar
-rw-rw-r-- 1 cloud cloud   4479 Sep 17 02:36 aws-java-sdk-1.7.4.jar

堆栈跟踪:

Traceback (most recent call last):
File "/home/cloud/sparks3test.py", line 18, in <module>
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 372, in json
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o33.json.
: java.io.IOException: No FileSystem for scheme: S3a
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)

您需要使用hadoop-aws 3.2.0版本。

你可以参考我之前的回答。

我得到一个错误(java.lang.NoClassDefFoundError:org/apache/hadoop/fs/StreamCapabilities)

当您混合使用hadoop-aws和hadoop-common JAR版本时,您将看到这个结果。它们必须点对点匹配(spark jar也需要)。

不要试图解决这个问题,除非通过同步jar,否则您只会移动堆栈跟踪。

参见Hadoop故障处理s3a

由于仍然存在jar依赖问题,我使用3.1.2和hadoop 3.2.0在spark上进行了新的安装,并在主节点和工作节点上将hadoop-aws和java-sdk jar与aws-common jar版本对齐。这纠正了文件系统问题。因此,升级到3.2.0也纠正了我们运行到的端点问题以及path.style。在2.8.0之前的hadoop版本中不支持Access =true。这个问题记录在这里:https://issues.apache.org/jira/browse/HADOOP-12963供参考。

最新更新