阿里云ack集群探针告警独立配置

1.背景

  “由于阿里云 ACK 中,存活、就绪和启动探针的告警被包含在通用 warn 告警中,且该告警的触发频率为一次性触发,这导致我司项目中三大探针的告警频率过于频繁。因此,需要将这三大探针的告警从‘通用 warn 告警’中剥离。”

 

 

2.配置

  1.找到ack集群告警配置项:报警配置→运维管理→告警配置

  2.点击warn事件集→高级设置→搜索通用

  所有的warn事件都在这个里面。

  注意:阿里云ack的k8s event告警事件本质上都是通过记录sls日志,通过sls告警来通知。所以这里只要熟悉以及会修改sls的sql就行。 

  2.1 warn告警剥离三大探针

  1.点击K8s通用Warn警示事件编辑

 

  将sql改为如下:

SQL
level : Warning  and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT
  ARRAY_AGG("eventId.message") as message,
  "eventId.metadata.namespace" as namespace,
  "eventId.involvedObject.kind" as kind,
  "eventId.involvedObject.name" as object_name,
  COUNT(*) as cnt
FROM  log
where
  "eventId.message" not like 'Liveness probe failed%'
  and "eventId.message" not like 'Readiness probe failed:%'
  and "eventId.message" not like 'Startup probe failed:%'
GROUP by
  namespace,
  kind,
  object_name

  查看以上代码块

  粗体三行就是过滤出三大探针的告警。

 

  2.2 自定义启动、存活、就绪探针

  这里有现成的模版,这里直接复制修改即可。

  1.复制K8s通用Warn警示事件

 

  2.自定义命名以及选择project以及logstore

 

  3.修改告警规则sql

  以下仅演示启动;就绪,存活探针按照启动方式复制修改即可

SQL
#启动探针
* and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT
  ARRAY_AGG("eventId.message") as message,
  "eventId.metadata.namespace" as namespace,
  "eventId.involvedObject.kind" as kind,
  "eventId.involvedObject.name" as object_name,
  COUNT(*) as cnt
FROM  log
where "eventId.message"  like 'Startup probe failed%'
GROUP by
  namespace,
  kind,
  object_name
  

#存活探针
* and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT
  ARRAY_AGG("eventId.message") as message,
  "eventId.metadata.namespace" as namespace,
  "eventId.involvedObject.kind" as kind,
  "eventId.involvedObject.name" as object_name,
  COUNT(*) as cnt
FROM  log
where "eventId.message"  like 'Liveness probe failed%'
GROUP by
  namespace,
  kind,
  object_name
  
  #就绪探针
  * and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT
  ARRAY_AGG("eventId.message") as message,
  "eventId.metadata.namespace" as namespace,
  "eventId.involvedObject.kind" as kind,
  "eventId.involvedObject.name" as object_name,
  COUNT(*) as cnt
FROM  log
where "eventId.message"  like 'Readiness probe failed%'
GROUP by
  namespace,
  kind,
  object_name

  按照以上的sql分别修改对应的探针,结果如下:

 

 

2.3 验证

 

 

posted @ 2025-03-14 17:49  小家电维修  阅读(69)  评论(0)    收藏  举报