在AWS emr的PySpark脚本中找不到com.amazon.ws.emr.hadoop.fs.EmrFileSys



我正试图使用AWS CLI创建一个EMR集群,以运行python脚本(使用pyspark(,如下所示:

aws emr create-cluster --name "emr cluster for pyspark (test)"
--applications Name=Spark Name=Hadoop --release-label emr-5.25.0 --use-default-roles 
--ec2-attributes KeyName=my-key --instance-groups 
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge 
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.xlarge 
--bootstrap-actions Path="s3://mybucket/my_bootstrap.sh" --steps 
Type=CUSTOM_JAR,Name="Spark Count group by QRACE",ActionOnFailure=CONTINUE
,Jar=s3://us-east-2.elasticmapreduce/libs/script-runner/script-runner.jar,
Args=["s3://mybucket/my_step.py","s3://mybucket/my_input.txt","s3://mybucket/output"]
--log-uri "s3://mybucket/logs"

引导程序脚本设置Python3.7,安装pyspark(2.4.3(并安装Java8。然而,我的脚本失败了,出现以下错误:

y4j.protocol.Py4JJavaError: An error occurred while calling o32.csv.
: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found

我已经尝试将带有以下json文件的--configurations参数添加到create-cluster命令中(但没有帮助(:

[
{
"Classification":"spark-defaults",
"Properties":{
"spark.executor.extraClassPath":"/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
"spark.driver.extraClassPath":"/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
}
} 

]

任何关于我可以去哪里看或我可以做什么的建议都会很有帮助!

编辑:我按照@Lamanus的建议解决了这个问题。但我的PySpark应用程序似乎在EMR 5.30.1上运行得很好,但在EMR 5.25.0 上运行得不好

我现在得到以下错误:

Exception in thread "main" org.apache.spark.SparkException: Application application_1596402225924_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1148)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1525)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我不知道在哪里可以找到有用的错误报告/日志来查找错误。它与EMR-5.30.1和Spark-2.4.5完美配合。

更新:发生这种情况是因为引导程序脚本安装了pyspark,而集群已经安装了pyspark。

无法对@chittychitty的最后一个答案进行投票,但这是正确的!不要在EMR提供的PySpark上安装PySpark。

最新更新