尝试为检查点作业启动 flink 作业主机超时



我正在尝试设置 flink 以从检查点恢复。在大多数情况下,这似乎有效,但是在将其部署到我们的暂存环境大约一周后,作业管理器由于尝试启动作业的"作业主控"时超时而开始崩溃循环。

我正在使用以高可用性模式部署的 flink 1.7.2 和 zookeeper 3.4.9-1757313 只是为了方便检查点恢复。我在 kubernetes 上只有一个作业管理器部署为有状态集。一定是某些原因导致服务器崩溃,并且在重新启动时,启动(大概)恢复作业的主作业的代码似乎失败了。

我以前见过一次并清除所有 flink zookeeper 条目(rmr /flink在 zk cli 中),然后重新启动 flink 集群"修复"了这个问题。

这是 flink 配置

blob.server.port: 6124
blob.storage.directory: s3://...
high-availability: zookeeper
high-availability.zookeeper.quorum: zookeeper:2181
high-availability.zookeeper.path.root: /flink
high-availability.storageDir: s3://...
high-availability.jobmanager.port: 6070
jobmanager.archive.fs.dir: s3://...
state.backend: rocksdb
state.backend.fs.checkpointdir: s3://...
state.checkpoints.dir: s3://...
state.checkpoints.num-retained: 2
web.log.path: /var/log/flink.log
web.upload.dir: /var/flink-recovery/flink-web-upload
zookeeper.sasl.disable: true
s3.access-key: __S3_ACCESS_KEY_ID__
s3.secret-key: __S3_SECRET_KEY__

以下是 flink-jobmaster 有状态集上的容器端口:

ports:
- containerPort: 8081
name: ui
- containerPort: 6123
name: rpc
- containerPort: 6124
name: blob
- containerPort: 6125
name: query
- containerPort: 9249
name: prometheus
- containerPort: 6070
name: ha

我希望 flink 能够从 s3 中的检查点成功恢复,但作业管理器在启动时崩溃,并带有以下堆栈跟踪:

2019-06-18 14:02:05,123 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
org.apache.flink.util.FlinkException: JobMaster for job f13131ca883d6cf92f69a52cff4f1017 failed.
at org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:759)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$startJobManagerRunner$6(Dispatcher.java:339)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.util.FlinkException: Could not start the job manager.
at org.apache.flink.runtime.jobmaster.JobManagerRunner.lambda$verifyJobSchedulingStatusAndStartJobManager$2(JobManagerRunner.java:340)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_2#-806528277]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.UnfencedMessage".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
... 1 more

我在这里真的很茫然。我对 flink 的内部工作原理了解不多,所以这个例外对我来说意义不大。任何线索将不胜感激。

编辑:我一直在浏览Flink源代码。在领导者当选后,当它尝试从存储在 zookeeper 中的检查点信息恢复其作业图时,会引发此异常。深入了解此异常的确切来源是相当麻烦的,因为它都包含在期货和 akka 中。我的猜测是,这是在作业管理器启动 JobMaster 子流程以安排作业图之后发生的。有点猜测,但我认为作业管理器正试图从其 JobMaster 获取新作业的状态,但 JobMaster 线程已进入死锁(也许它也可能已经死亡,尽管我当时期望堆栈跟踪),因此询问超时。似乎真的是打瞌睡。

注意:请求所针对的UnfencedMessage是在作业管理器中本地使用的(这与异常中的接收参与者是作业管理器重合),因此我们可以消除 JobMaster 和任务管理器之间的网络错误配置。

我在执行之前使用/jars/upload端点在 flink 上暂存 jar。似乎 flink 的性能坦克上传了太多罐子。所有终结点都变得无响应,包括/jobs/<job_id>终结点。在 flink UI 中加载作业图概述需要 1 - 2 分钟。我想这个休息端点使用与作业管理器相同的 akka actor。我想我一定达到了一个临界点,这开始导致超时。我已经将 30 个奇数的罐子数量减少到只有 4 个最新版本,flink 再次响应。

最新更新