在提交 Apache Spark 作业时使用 spark.jar 中的通配符

我有一组JAR想要提供给我的Spark作业，存储在HDFS上。

Spark 2.3的文档说spark.jars是它的参数：

spark.jars: Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.

但是，将spark.jars设置为hdfs:///path/to/my/libs/*.jar会失败：驱动程序启动正常，启动一个阶段，但随后任务因以下原因而死亡：

WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, xxxx, executor 1): java.io.FileNotFoundException: File hdfs:/path/to/my/libs/*.jar does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:901) at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:724) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:692) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:472) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755) ...

也就是说，在执行器上运行时，它似乎没有扩展 glob。

将spark.jars显式设置为hdfs:///path/to/my/libs/libA.jar,hdfs:///path/to/my/libs/libB.jar确实可以正常工作。

如文档所示，如何在spark.jars中使用 glob？

我正在从本地文件系统运行所有火花批处理和流应用程序。我不确定为什么需要将它们存储在 hdfs 上。

但是，如果您更喜欢使用本地文件系统来保存jar，则可以使用通配符，如下所示：

export BASE_DIR="/local/file/path/where/jar/is/available"
spark-submit 
--class ${class} 
--deploy-mode cluster 
--master yarn 
...
...
...
--name ${APPLICATION_NAME} 
${BASE_DIR}/*.jar

希望这是有帮助的。

相关内容

最新更新

热门标签：