在集群模式下与spark-submit共享配置文件

在开发过程中，我一直在"客户端"模式下运行我的spark作业。我使用"——file"来与执行器共享配置文件。驱动程序正在本地读取配置文件。现在我想在"集群"模式下部署作业。我现在很难与驱动程序共享配置文件。

Ex，我将配置文件名作为extraJavaOptions传递给驱动程序和执行程序。我正在使用SparkFiles.get()

读取文件

  val configFile = org.apache.spark.SparkFiles.get(System.getProperty("config.file.name"))

这在执行器上工作得很好，但在驱动程序上失败。我认为这些文件只与执行器共享，而不是与驱动程序运行的容器共享。一种选择是将配置文件保存在S3中。我想检查一下是否可以使用spark-submit实现这一点。

> spark-submit --deploy-mode cluster --master yarn --driver-cores 2
> --driver-memory 4g --num-executors 4 --executor-cores 4 --executor-memory 10g 
> --files /home/hadoop/Streaming.conf,/home/hadoop/log4j.properties 
> --conf **spark.driver.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" 
> --conf **spark.executor.extraJavaOptions**="-Dlog4j.configuration=log4j.properties
> -Dconfig.file.name=Streaming.conf" 
> --class ....

您需要尝试Spark submit命令中的--properties-file选项

例如属性文件内容

spark.key1=value1
spark.key2=value2

所有的键必须是prefixed和spark。

然后使用spark-submit命令传递属性文件

bin/spark-submit --properties-file  propertiesfile.properties

然后在代码中，您可以使用下面的sparkcontext getConf方法获得密钥。

sc.getConf.get("spark.key1")  // returns value1

一旦你得到了键值，你就可以在任何地方使用它

我在这个线程中找到了这个问题的解决方案。

你可以给你通过——files提交的文件一个别名，在末尾加上"#alias"。通过这个技巧，您应该能够通过文件的别名访问这些文件。

例如，下面的代码可以运行而不会出现错误:

spark-submit --master yarn-cluster --files test.conf#testFile.conf test.py

with test.py as:

path_f = 'testFile.conf'
try:
    f = open(path_f, 'r')
except:
    raise Exception('File not opened', 'EEEEEEE!')

和一个空test.conf

相关内容

最新更新

热门标签：