添加jar驱动程序到emr-6.7.0 spark



我试图从emr集群连接到aws redis集群,我将jar驱动程序上传到s3,并使用此引导操作将jar文件复制到集群节点:

aws s3 cp s3://sparkbcuket/spark-redis-2.3.0.jar /home/hadoop/spark-redis-2.3.0.jar

这是我的连接测试spark应用:

import sys
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder
.config("spark.redis.host", "testredis-0013.vb4vgr.00341.eu1.cache.amazonaws.com")
.config("spark.redis.port", "6379")
.appName("Redis_test").getOrCreate()
df = spark.read.format("org.apache.spark.sql.redis").option("key.column", "key").option("keys.pattern","*").load()
df.write.csv(path='s3://sparkbucket/',sep=',')

spark.stop()

使用spark-submit:

spark-submit --deploy-mode cluster --driver-class-path /home/hadoop/spark-redis-2.3.0.jar s3://sparkbucket/testredis.py

我得到以下错误,不确定我做错了什么:

ERROR Client: Application diagnostics message: User application exited with status 1 Exception in thread "main" org.apache.spark.SparkException: Application application_1658168513779_0001 finished with failed status

使用类似的测试代码,我通过在S3中上传spark-redis jar并使用——jars作为参数成功运行,如下所示:

spark-submit --deploy-mode cluster --jars s3://<bucket/path>/spark-redis_2.12-3.1.0-SNAPSHOT-jar-with-dependencies.jar s3://<bucket/path>/redis_test.py

运行的详细日志可以在Spark历史服务器中查看。可以按照以下链接序列在EMR web控制台中访问该目录:

总结→Spark历史服务器->application_xxx_xxx→执行人→(司机)stdout

您将得到NoSuchKey错误,因为它需要一些时间才能获得日志,只需重新加载。