flink on yarn 高可用

1、flink集群配置(flink-conf.yaml)

high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha
jobmanager.execution.failover-strategy: full
high-availability.zookeeper.quorum: hadoop01:2181,hadoop02:2181,hadoop03:2181
high-availability.zookeeper.path.root: /flink-ha

 

2、flink任务配置(重启策略)

# restart-strategy.type 有以下类型(推荐用3): 
    # 1. disable, off, none: No restart strategy.
    # 2. fixed-delay, fixeddelay: Fixed delay restart strategy.
    # 3. failure-rate, failurerate: Failure rate restart strategy.
    # 4. exponential-delay, exponentialdelay: Exponential delay restart strategy. 
# If checkpointing is disabled, the default value is disable. If checkpointing is enabled, the default value is exponential-delay, and the default values of exponential-delay related config options will be used.

# restart-strategy.type = "failurerate"
restart-strategy.type = "failurerate"
restart-strategy.failure-rate.delay = 1s
restart-strategy.failure-rate.failure-rate-interval = 1min
restart-strategy.failure-rate.max-failures-per-interval = 1
# 举例,以下配置代表:在 10 分钟的时间窗口内,如果 Task 失败超过 3 次,则 Job 进入 FAILED 状态,不再重启,每次重启之间延迟 10 秒
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s

 
# restart-strategy.type = "fixeddelay" (注意:如果采用该方式,服务器重启次数 超过 restart-strategy.fixed-delay.attempts配置,可能会导致任务自动恢复失败)
restart-strategy.type = "fixeddelay"
restart-strategy.fixed-delay.attempts = 1
restart-strategy.fixed-delay.delay = 1s

    

 

3、验证(以"failurerate"为例)

# 举例,以下配置代表:在 10 分钟的时间窗口内,如果 Task 失败超过 3 次,则 Job 进入 FAILED 状态,不再重启,每次重启之间延迟 10 秒
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 10 min
restart-strategy.failure-rate.delay: 10 s
  • 找到tm位置(yarn web ui)

image

 

  • kill tm任务(第1次)

image

  • kill tm任务(第2次)

image

image

  • kill tm任务(第3次)

image

image

  • kill tm任务(第4次)

image

哈哈哈,任务GG了。。。。。

image

查看yarn log (jobmanager.log)

image

 

4、验证(以"fixeddelay"为例,主要验证taskmanager失败次数!!!) 

# restart-strategy.fixed-delay.attempts (taskmanager失败次数超过2,flink job无法恢复)
restart-strategy.fixed-delay.attempts=2
  • 找到flink任务的taskmanager位置(yarn web ui)

image

  • kill tm任务(第1次)

image

image

  • 继续kill tm任务(第2次)

image

image

  • 继续kill tm任务(第3次)

image

哈哈哈,任务GG了。。。。。

image

查看yarn log (jobmanager.log)

image

posted @ 2018-08-13 18:16  lvlin241  阅读(229)  评论(0)    收藏  举报