我正在尝试启动一个在纱线上创建SparkContext
的程序。这是我的简单程序:
object Entry extends App {
System.setProperty("SPARK_YARN_MODE", "true")
val sparkConfig = new SparkConf()
.setAppName("test-connection")
.setMaster("yarn-client")
val sparkContext = new SparkContext(sparkConfig)
val numbersRDD = sparkContext.parallelize(List(1, 2, 3, 4, 5))
println {
s"result is ${numbersRDD.reduce(_ + _)}"
}
}
build.sbt
scalaVersion := "2.10.5"
libraryDependencies ++= {
val sparkV = "1.6.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkV,
"org.apache.spark" %% "spark-yarn" % sparkV,
)
}
我正在使用谷歌云数据处理器通过sbt run
在主节点内运行此程序
这些是日志:
16/03/09 08:38:31 INFO YarnClientImpl: Submitted application application_1457497836188_0013 to ResourceManager at /0.0.0.0:8032
16/03/09 08:38:32 INFO Client: Application report for application_1457497836188_0013 (state: ACCEPTED)
16/03/09 08:38:32 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1457512711191
final status: UNDEFINED
tracking URL: http://recommendation-cluster-m:8088/proxy/application_1457497836188_0013/
user: ibosz
16/03/09 08:38:33 INFO Client: Application report for application_1457497836188_0013 (state: ACCEPTED)
16/03/09 08:38:34 INFO Client: Application report for application_1457497836188_0013 (state: ACCEPTED)
16/03/09 08:38:35 INFO Client: Application report for application_1457497836188_0013 (state: FAILED)
16/03/09 08:38:35 INFO Client:
client token: N/A
diagnostics: Application application_1457497836188_0013 failed 2 times due to AM Container for appattempt_1457497836188_0013_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://recommendation-cluster-m:8088/cluster/app/application_1457497836188_0013Then, click on links to logs of each attempt.
Diagnostics: java.io.FileNotFoundException: File file:/home/ibosz/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.6.0.jar does not exist
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1457512711191
final status: FAILED
tracking URL: http://recommendation-cluster-m:8088/cluster/app/application_1457497836188_0013
user: ibosz
16/03/09 08:38:35 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
at Entry$delayedInit$body.apply(Entry.scala:13)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at Entry$.main(Entry.scala:6)
at Entry.main(Entry.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sbt.Run.invokeMain(Run.scala:67)
at sbt.Run.run0(Run.scala:61)
at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Logger$$anon$4.apply(Logger.scala:85)
at sbt.TrapExit$App.run(TrapExit.scala:248)
at java.lang.Thread.run(Thread.java:745)
它说
java.io.FileNotFoundException: File file:/home/ibosz/.ivy2/cache/org.apache.spark/spark-yarn_2.10/jars/spark-yarn_2.10-1.6.0.jar does not exist
但它确实存在。并且运行spark-shell --master yarn-client
没有问题.我的代码有什么问题?
虽然可能有一种方法可以强制sbt run
正确执行真正的yarn-client
模式 Spark 提交,但您可能只想这样做:
sbt package
spark-submit target/scala-2.10/*SNAPSHOT.jar
从本质上讲,您遇到的错误是,当创建SparkContext时,它会要求远程YARN容器来保存AppMaster进程,该进程将驻留在您的一个工作节点上。它传递主数据库本地环境的各个方面,其中包括生成中使用的特定于 sbt 的 Spark 程序集副本(在 ~/.ivy2/cache/
目录下)。工作线程的环境与您运行的环境不匹配 sbt run
,这就是它失败的原因。
请注意,spark-submit
命令本身只是一个 bash 脚本,其全部目的是使用所有正确的环境变量和类路径配置运行 jarfile,因此任何sbt run
工作的内容本质上都会复制spark-submit
脚本的逻辑,并且可能以不可移植的方式执行此操作。
所有这一切的好处是,使用spark-submit foo.jar
将使你的调用变得漂亮和可移植;例如,一旦你想生产你的作业,你可以在同一个jarfile上使用Dataproc的作业提交API,就像你使用spark-submit:gcloud dataproc jobs submit spark --jar foo.jar <your_job_args>
一样,你甚至可以通过Dataproc的Web GUI提交这些Sparkjar文件,只需先将jarfile上传到GCS,然后指定jarfile的gs://
路径即可。工作。
同样,如果您仅通过解压缩标准 Spark 二进制发行版来设置本地 Spark,即使您没有在该本地 Spark 设置上安装sbt
,您仍然可以使用 spark-submit
。