在Yarn Client上运行Spark

我最近安装了一个多节点Hadoop HA（Namenode&ResourceManager）集群（3节点），安装完成，所有守护进程都按预期运行

NN1:中的守护程序

2945 JournalNode
3137 DFSZKFailoverController
6385 Jps
3338 NodeManager
22730 QuorumPeerMain
2747 DataNode
3228 ResourceManager
2636 NameNode

NN2:中的守护程序

19620 Jps
3894 QuorumPeerMain
16966 ResourceManager
16808 NodeManager
16475 DataNode
16572 JournalNode
17101 NameNode
16702 DFSZKFailoverController

DN1:中的守护程序

12228 QuorumPeerMain
29060 NodeManager
28858 DataNode
29644 Jps
28956 JournalNode

我有兴趣在我的Yarn设置上运行Spark Jobs。我已经在NN1上安装了Scala和Spark，我可以通过发出以下命令成功启动我的Spark

$ spark-shell

现在，我对SPARK一无所知，我想知道如何在Yarn上运行SPARK。我已经读到，我们可以将其作为yarn客户端或yarn集群来运行。

我应该安装火花&scala在集群（NN2&DN1）中的所有节点上运行，以在Yarn客户端或集群上运行spark？如果否，那么我如何从NN1（主名称节点）主机提交Spark作业。

正如我读到的一篇博客中所建议的那样，我已经将Spark组装JAR复制到了HDFS中

-rw-r--r--   3 hduser supergroup  187548272 2016-04-04 15:56 /user/spark/share/lib/spark-assembly.jar

还在我的bashrc文件中创建了SPARK_JAR变量。我试图将Spark Job作为yarn客户端提交，但最终出现了如下错误，我不知道我是做得正确还是需要先做其他设置。

[hduser@ptfhadoop01v spark-1.6.0]$ ./bin/spark-submit --class     org.apache.spark.examples.SparkPi --master yarn  --deploy-mode client --driver-memory 4g --executor-memory 2g --executor-cores 2 --queue thequeue lib/spark-examples*.jar 10
16/04/04 17:27:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/04 17:27:51 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.
Please instead use:
 - ./spark-submit with --num-executors to specify the number of executors
 - Or set SPARK_EXECUTOR_INSTANCES
 - spark.executor.instances to configure the number of instances in the spark config.
16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment.  This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment.   This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
    at    org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at   org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at   org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/04/04 17:27:58 WARN MetricsSystem: Stopping a MetricsSystem that is not running
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
    at   org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[hduser@ptfhadoop01v spark-1.6.0]$

请帮助我解决这个问题，以及如何在Yarn上以客户端或集群模式运行Spark。

现在，我对SPARK一无所知，我想知道如何在Yarn上运行SPARK。我已经读到，我们可以将其作为yarn客户端或yarn集群来运行。

强烈建议您在YARN上阅读Spark的官方文档http://spark.apache.org/docs/latest/running-on-yarn.html.

您可以使用spark-shell和--master yarn来连接到YARN。您需要在执行spark-shell的机器上有适当的配置文件，例如yarn-site.xml。

我应该安装火花&scala在集群（NN2&DN1）中的所有节点上运行，以在Yarn客户端或集群上运行spark？

没有。你不必在YARN上安装任何东西，因为Spark会为你分发必要的文件。

如果否，那么我如何从NN1（主名称节点）主机提交Spark作业。

从spark-shell --master yarn开始，看看是否可以执行以下代码：

(0 to 5).toDF.show

如果你看到一个类似表格的输出，你就完了。否则，请提供错误。

还在我的bashrc文件中创建了SPARK_JAR变量。我试图将Spark Job作为yarn客户端提交，但最终出现了如下错误，我不知道我是做得正确还是需要先做其他设置。

删除SPARK_JAR变量。不要使用它，因为它不需要，可能会引起麻烦。阅读官方文档，网址：http://spark.apache.org/docs/latest/running-on-yarn.html了解Spark在YARN及其他平台上的基础知识。

通过将此属性添加到hdfs-site.xml中，它解决了问题

<property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

在客户端模式下，您可以运行下面的简单字数示例

spark-submit --class org.sparkexample.WordCount --master yarn-client wordcount-sample-plain-1.0-SNAPSHOT.jar input.txt output.txt

我想你的"火花提交"命令错了。没有设置主纱线。我强烈建议使用自动资源调配工具来快速设置集群，而不是手动方法。

请参阅Cloudera或Hortonworks工具。您可以使用它立即进行设置，并且可以轻松提交作业，而无需手动进行所有这些配置。

参考：https://hortonworks.com/products/hdp/

相关内容

最新更新

热门标签：