[Flink/FAQ] FlinkJob正常运行数日后,报:`NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout`(无可获得的资源异常:Slot请求bulk不可满足)

1 问题描述

  • FlinkJob(Flink1.12) 正常运行了近7个月,今日突然运行崩溃,JobManager日志中报错误:

重试了3次,并产生了3次此类日志后,作业停止运行。
Failed to trigger checkpoint for job ffffffffd8e074f7000001936c41519d because Not all required tasks are currently running.(表象:触发ffffffffd8e074f7000001936c41519d作业检查点失败,因为当前并非所有需要的任务都在运行。)
NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout (根因:无可获得的资源异常:Slot请求bulk不可满足!无法在Slot请求超时内分配所需的Slot)

...
2025-06-06 08:01:12,323 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  777 [flink-akka.actor.default-dispatcher-3]  - Received new TaskManager pod: flink-284536-taskmanager-1-67-80c6783c-ff44-4a7c-96b8-390701b77698
2025-06-06 08:01:12,323 INFO  org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager 426 [flink-akka.actor.default-dispatcher-3]  - Requested worker flink-284536-taskmanager-1-67-80c6783c-ff44-4a7c-96b8-390701b77698 with resource spec WorkerResourceSpec {cpuCores=1.4, taskHeapSize=3.147gb (3378952231 bytes), taskOffHeapSize=0 bytes, networkMemSize=128.000mb (134217728 bytes), managedMemSize=2.348gb (2521070338 bytes)}.
2025-06-06 08:01:12,326 INFO  org.apache.flink.kubernetes.KubernetesResourceManagerDriver  299 [flink-akka.actor.default-dispatcher-3]  - Pod flink-284536-taskmanager-1-67-80c6783c-ff44-4a7c-96b8-390701b77698 is created.
2025-06-06 08:01:13,657 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    1975 [Checkpoint Timer]  - Checkpoint triggering task Source: vehicle status kafka source (1/1) of job ffffffffd8e074f7000001936c41519d is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
2025-06-06 08:01:13,657 WARN  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    914 [Checkpoint Timer]  - Failed to trigger checkpoint for job ffffffffd8e074f7000001936c41519d because Not all required tasks are currently running.
2025-06-06 08:01:13,687 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       1778 [flink-akka.actor.default-dispatcher-2]  - Filter:过滤XXXX数据 (2/2) (ba16f2343dcdd1ef42f614baedfbecff) switched from SCHEDULED to FAILED on [unassigned resource].
java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_412]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_412]
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) ~[?:1.8.0_412]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_412]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_412]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_412]
	at org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:223) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:168) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:86) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:91) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_412]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_412]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
	at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	... 24 more
Caused by: java.util.concurrent.TimeoutException: Timeout has occurred: 300000 ms
	at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86) ~[flink-dist_2.11-1.12.2-h0.cbu.dli.233.r34.jar:1.12.2-h0.cbu.dli.233.r34]
	... 24 more
2025-06-06 08:01:13,688 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       1786 [flink-akka.actor.default-dispatcher-2]  - Call stack:
    at java.lang.Thread.getStackTrace(Thread.java:1564)
    at org.apache.flink.runtime.executiongraph.Execution.transitionState(Execution.java:1787)
    at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1443)
    at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1216)
    at org.apache.flink.runtime.executiongraph.ExecutionGraph.updateStateInternal(ExecutionGraph.java:1626)
    at org.apache.flink.runtime.executiongraph.ExecutionGraph.updateState(ExecutionGraph.java:1588)
    at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:664)
    at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:56)
    at org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1873)
    at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1438)
    at org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1378)
    at org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1206)
    at org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:758)
    at org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
    at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:522)
    at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:507)
    at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
    at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
    at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
    at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
    at org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:223)
    at org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:168)
    at org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:86)
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
    at org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:91)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
    at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2025-06-06 08:01:13,688 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       1595 [flink-akka.actor.default-dispatcher-2]  - Discarding the results produced by task execution ba16f2343dcdd1ef42f614baedfbecff.
...

2 原因分析

  • 资源队列(如: Yarn / K8s等)中资源不足,导致创建 taskmanager 启动不成功。
  • 资源不足,无法正常分配 Slot。
  • 用户的jar包与环境中的jar包冲突导致,可以通过执行wordcount程序是否成功来判断。
  • 如果集群为安全集群,可能是Flink的SSL证书配置错误,或者证书过期。

3 解决方法

  • 检查资源是否充足。(亲测有效)
  • 如果不足,增加队列的资源。
  • 重启 Flink Job。
  • 排除用户jar包中的Flink和Hadoop依赖,依赖环境中的jar包 (未亲测)

  • 重新配置Flink的SSL证书,可参考从零开始使用Flink (未亲测)

X 参考文献

资源不足

资源不足错误: NoResourceAvailableException: Could not acquire the minimum required resources
jobmanager.memory.process.size: 2600m
taskmanager.memory.process.size: 2728m
taskmanager.memory.flink.size: 2280m

Created 2025.5.23
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout

posted @ 2025-06-06 10:07  千千寰宇  阅读(229)  评论(0)    收藏  举报