我正试图按照以下说明使用Spark Oozie操作启用历史日志。https://archive.cloudera.com/cdh5/cdh/5/oozie/DG_SparkActionExtension.html
为了确保您的Spark作业显示在Spark History Server中,请确保在带有--conf的Spark opts中或从oozie.service.SparkConfigurationService.Spark.configurations 中指定这三个Spark配置属性
- spark.yarn.historyServer.address=http://SPH-HOST:18088
- spark.eventLog.dir=hdfs://NN:8020/user/spark/applicationHistory
- spark.eventLog.enabled=true
工作流定义如下:
<action name="spark-9e7c">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>Correlation Engine</name>
<class>Main Class</class>
<jar>hdfs://<MACHINE IP>:8020/USER JAR</jar>
<spark-opts> --conf spark.eventLog.dir=<MACHINE IP>:8020/user/spark/applicationHistory --conf spark.eventLog.enabled=true --conf spark.yarn.historyServer.address=<MACHINE IP>:18088/</spark-opts>
</spark>
<ok to="email-f5d5"/>
<error to="email-a687"/>
</action>
当我从shell脚本进行测试时,历史日志记录正确,但使用Oozie操作时,日志记录不正确。这三个属性我都设置好了。
根据我的经验,我认为你把论点传错了地方。
请参阅下面的xml片段
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns='uri:oozie:workflow:0.4' name='sparkjob'>
<start to='spark-process' />
<action name='spark-process'>
<spark xmlns='uri:oozie:spark-action:0.1'>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.service.SparkConfigurationService.spark.configurations</name>
<value>spark.eventLog.dir=hdfs://node1.analytics.sub:8020/user/spark/applicationHistory,spark.yarn.historyServer.address=http://node1.analytics.sub:18088,spark.eventLog.enabled=true</value>
</property>
<!--property>
<name>oozie.hive.defaults</name>
<value>/user/ambari-qa/sparkActionPython/hive-config.xml</value>
</property-->
<!--property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property-->
<property>
<name>oozie.service.WorkflowAppService.system.libpath</name>
<value>/user/oozie/share/lib/lib_20150831190253/spark</value>
</property>
</configuration>
<master>yarn-client</master>
<!--master>local[4]</master-->
<mode>client</mode>
<name>wordcount</name>
<jar>/usr/hdp/current/spark-client/AnalyticsJar/wordcount.py</jar>
<spark-opts>--executor-memory 1G --driver-memory 1G --executor-cores 4 --num-executors 2 --jars /usr/hdp/current/spark-client/lib/spark-assembly-1.3.1.2.3.0.0-2557-hadoop2.7.1.2.3.0.0-2557.jar</spark-opts>
</spark>
<ok to='end'/>
<error to='spark-fail'/>
</action>
<kill name='spark-fail'>
<message>Spark job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>