阿里云ack集群探针告警独立配置
1.背景
“由于阿里云 ACK 中,存活、就绪和启动探针的告警被包含在通用 warn 告警中,且该告警的触发频率为一次性触发,这导致我司项目中三大探针的告警频率过于频繁。因此,需要将这三大探针的告警从‘通用 warn 告警’中剥离。”
2.配置
1.找到ack集群告警配置项:报警配置→运维管理→告警配置

2.点击warn事件集→高级设置→搜索通用
所有的warn事件都在这个里面。
注意:阿里云ack的k8s event告警事件本质上都是通过记录sls日志,通过sls告警来通知。所以这里只要熟悉以及会修改sls的sql就行。

2.1 warn告警剥离三大探针
1.点击K8s通用Warn警示事件编辑
 
 
将sql改为如下:
SQL level : Warning and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt FROM log where "eventId.message" not like 'Liveness probe failed%' and "eventId.message" not like 'Readiness probe failed:%' and "eventId.message" not like 'Startup probe failed:%' GROUP by namespace, kind, object_name
查看以上代码块
粗体三行就是过滤出三大探针的告警。
2.2 自定义启动、存活、就绪探针
这里有现成的模版,这里直接复制修改即可。
1.复制K8s通用Warn警示事件
 
 
2.自定义命名以及选择project以及logstore
 
 
3.修改告警规则sql
以下仅演示启动;就绪,存活探针按照启动方式复制修改即可

SQL #启动探针 * and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt FROM log where "eventId.message" like 'Startup probe failed%' GROUP by namespace, kind, object_name #存活探针 * and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt FROM log where "eventId.message" like 'Liveness probe failed%' GROUP by namespace, kind, object_name #就绪探针 * and not "Error updating Endpoint Slices for Service" and not (eventId.reason: AccessACRApiFailed and eventId.message:USER_NOT_EXIST) and not eventId.reason: "CIS.ScheduleTask.Warning" and not eventId.reason: "CIS.ScheduleTask.Fail" | SELECT ARRAY_AGG("eventId.message") as message, "eventId.metadata.namespace" as namespace, "eventId.involvedObject.kind" as kind, "eventId.involvedObject.name" as object_name, COUNT(*) as cnt FROM log where "eventId.message" like 'Readiness probe failed%' GROUP by namespace, kind, object_name
按照以上的sql分别修改对应的探针,结果如下:
 
 
2.3 验证

 
                    
                     
                    
                 
                    
                
 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号