我们正在kubernetes和azure上运行一个5节点的flink集群(每个8 gb ram,总共40个插槽)。我们正在运行四个作业,所有作业都使用kafka的数据(每个作业都在不同的消费者组中)。几天前,随着数据负载的增加,我们将生产商转移到5个kafka分区上生产数据,并将作业并行性转移到5。从那时起,我们的一位任务经理不时(平均每小时)出现以下异常:
NFO|N||-|||Flink-4jc| 2019-01-22 16:00:32,032 Task:917 - org=[] - Map (2/5) (949a8349e7bdcf3fe3b8f992f52d249c) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: Partition 86656e59799eb529f24bac704ea06790@b1955e1a072e3b2f9e1f969fea509841 not found.
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:273)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:182)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:400)
at org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1293)
at org.apache.flink.runtime.taskmanager.Task.lambda$triggerPartitionProducerStateCheck$1(Task.java:1150)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
异常发生在不同的任务和作业上。我已经阅读了以下内容:http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/PartitionNotFoundException-when-running-in-yarn-session-td16081.html这给了我一些可能导致异常的提示,但我仍然不知道是什么导致了我的情况(增加超时和网络缓冲区大小没有帮助,我不明白为什么jar文件大小很重要)
有人能指导我如何调查正在发生的事情、我应该打开什么日志、更改什么配置等吗?如果需要任何其他细节,我很乐意提供。
谢谢!
可能您节点的网卡已满