flink on yarn 高可用
1、flink集群配置(flink-conf.yaml)
high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha
jobmanager.execution.failover-strategy: full
high-availability.zookeeper.quorum: hadoop01:2181,hadoop02:2181,hadoop03:2181
high-availability.zookeeper.path.root: /flink-ha
2、flink任务配置(重启策略)
# restart-strategy.type 有以下类型(推荐用3):
# 1. disable, off, none: No restart strategy.
# 2. fixed-delay, fixeddelay: Fixed delay restart strategy.
# 3. failure-rate, failurerate: Failure rate restart strategy.
# 4. exponential-delay, exponentialdelay: Exponential delay restart strategy.
# If checkpointing is disabled, the default value is disable. If checkpointing is enabled, the default value is exponential-delay, and the default values of exponential-delay related config options will be used.
# restart-strategy.type = "failurerate"
restart-strategy.type = "failurerate"
restart-strategy.failure-rate.delay = 1s
restart-strategy.failure-rate.failure-rate-interval = 1min
restart-strategy.failure-rate.max-failures-per-interval = 1
# 举例,以下配置代表:在 10 分钟的时间窗口内,如果 Task 失败超过 3 次,则 Job 进入 FAILED 状态,不再重启,每次重启之间延迟 10 秒
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s
# restart-strategy.type = "fixeddelay" (注意:如果采用该方式,服务器重启次数 超过 restart-strategy.fixed-delay.attempts配置,可能会导致任务自动恢复失败)
restart-strategy.type = "fixeddelay"
restart-strategy.fixed-delay.attempts = 1
restart-strategy.fixed-delay.delay = 1s
3、验证(以"failurerate"为例)
# 举例,以下配置代表:在 10 分钟的时间窗口内,如果 Task 失败超过 3 次,则 Job 进入 FAILED 状态,不再重启,每次重启之间延迟 10 秒
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s
- 找到tm位置(yarn web ui)

- kill tm任务(第1次)

- kill tm任务(第2次)


- kill tm任务(第3次)


- kill tm任务(第4次)

哈哈哈,任务GG了。。。。。

查看yarn log (jobmanager.log)

4、验证(以"fixeddelay"为例,主要验证taskmanager失败次数!!!)
# restart-strategy.fixed-delay.attempts (taskmanager失败次数超过2,flink job无法恢复)
restart-strategy.fixed-delay.attempts=2
- 找到flink任务的taskmanager位置(yarn web ui)

- kill tm任务(第1次)


- 继续kill tm任务(第2次)


- 继续kill tm任务(第3次)

哈哈哈,任务GG了。。。。。

查看yarn log (jobmanager.log)


浙公网安备 33010602011771号