Spark 流式处理作业在被驱动程序停止后失败



我有一个火花流作业,它从Kafka读取数据并对其执行一些操作。我正在一个纱线集群 Spark 1.4.1 上运行作业,该集群有两个节点,每个节点有 16 GB RAM,每个节点有 16 个内核。

我将这些 conf 传递给火花提交作业:

-

-主纱簇 --执行器数 3 --驱动程序内存 4g --执行器内存 2g --执行器内核 3

作业返回此错误,并在运行一小段时间后完成:

INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11,
(reason: Max number of executor failures reached)
.....
ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0:
Stopped by driver

更新:

这些日志也被找到:

INFO yarn.YarnAllocator: Received 3 containers from YARN, launching executors on 3 of them.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.
....
INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
INFO yarn.ExecutorRunnable: Starting Executor Container.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down...
INFO yarn.YarnAllocator: Completed container container_e10_1453801197604_0104_01_000006 (state: COMPLETE, exit status: 1)
INFO yarn.YarnAllocator: Container marked as failed: container_e10_1453801197604_0104_01_000006. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e10_1453801197604_0104_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
    at org.apache.hadoop.util.Shell.run(Shell.java:487)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1

可能是什么原因呢?感谢一些帮助。

谢谢

你能展示一下你从Kafka读取的Scala/Java代码吗?我怀疑你可能没有正确创建你的SparkConf。

尝试类似的东西

SparkConf sparkConf = new SparkConf().setAppName("ApplicationName");

还可以尝试在 yarn-client 模式下运行应用程序并共享输出。

我遇到了同样的问题。 我找到了 1 种解决方案来解决问题,方法是删除main函数末尾的sparkContext.stop(),将stop操作留给 GC。

Spark

团队已经解决了Spark核心中的问题,但是,到目前为止,该修复程序只是主分支。我们需要等到修复程序更新到新版本中。

https://issues.apache.org/jira/browse/SPARK-12009

相关内容

  • 没有找到相关文章