SkyWalking 告警功能
告警规则:
SkyWalking 告警功能是在6.x版本新增的,其核心由一组规则驱动,这些规则定义在config/alarm-settings.yml文件中。 告警规则的定义分为两部分:
1. 告警规则:它们定义了应该如何触发度量警报,应该考虑什么条件。
https://github.com/apache/skywalking/blob/v8.5.0/docs/en/setup/backend/backend-alarm.md
2. Webhook(网络钩子):定义当警告触发时,哪些服务终端需要被告知
https://github.com/apache/skywalking/blob/v8.5.0/docs/en/setup/backend/backend-alarm.md
规则:
SkyWalking 的发行版都会默认提供config/alarm-settings.yml文件,里面预先定义了一些常用的告警规则。如下:
1. 过去 3 分钟内服务平均响应时间超过 1 秒。
2. 过去 2 分钟服务成功率低于80%。
3. 过去 3 分钟内服务响应时间超过 1s 的百分比
4. 服务实例在过去 2 分钟内平均响应时间超过 1s,并且实例名称与正则表达式匹配。
5. 过去 2 分钟内端点平均响应时间超过 1 秒。
6. 过去 2 分钟内数据库访问平均响应时间超过 1 秒。
7. 过去 2 分钟内端点关系平均响应时间超过 1 秒。
这些预定义的告警规则,打开config/alarm-settings.yml文件即可看到
告警规则配置项的说明:
Rule name:规则名称,也是在告警信息中显示的唯一名称。必须以_rule结尾,前缀可自定义
Metrics name:度量名称,取值为oal脚本中的度量名,目前只支持long、double和int类型。详见Official OAL script (https://github.com/apache/skywalking/blob/master/docs/en/guides/backend-oal-scripts.md)
Include names:该规则作用于哪些实体名称,比如服务名,终端名(可选,默认为全部)
Exclude names:该规则作不用于哪些实体名称,比如服务名,终端名(可选,默认为空)
Threshold:阈值 OP: 操作符,目前支持 >、<、= Period:多久告警规则需要被核实一下。这是一个时间窗口,与后端部署环境时间相匹配
Count:在一个Period窗口中,如果values超过Threshold值(按op),达到Count值,需要发送警报
Silence period:在时间N中触发报警后,在TN -> TN + period这个阶段不告警。 默认情况下,它和Period一样,这意味着 相同的告警(在同一个Metrics name拥有相同的Id)在同一个Period内只会触发一次
message:告警消息
Webhook(网络钩子)
第一步:在接口配置到SkyWalking中,Webhook的配置位于config/alarm-settings.yml文件的末尾,格式为http://{ip}:{port}/{uri}
# Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Sample alarm rules. rules: # Rule unique name, must be ended with `_rule`. service_resp_time_rule: metrics-name: service_resp_time op: ">" threshold: 1000 period: 10 count: 3 silence-period: 5 message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes. service_sla_rule: # Metrics value need to be long, double or int metrics-name: service_sla op: "<" threshold: 8000 # The length of time to evaluate the metrics period: 10 # How many times after the metrics match the condition, will trigger alarm count: 2 # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. silence-period: 3 message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes service_resp_time_percentile_rule: # Metrics value need to be long, double or int metrics-name: service_percentile op: ">" threshold: 1000,1000,1000,1000,1000 period: 10 count: 3 silence-period: 5 message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000 service_instance_resp_time_rule: metrics-name: service_instance_resp_time op: ">" threshold: 1000 period: 10 count: 2 silence-period: 5 message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes database_access_resp_time_rule: metrics-name: database_access_resp_time threshold: 1000 op: ">" period: 10 count: 2 message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes endpoint_relation_resp_time_rule: metrics-name: endpoint_relation_resp_time threshold: 1000 op: ">" period: 10 count: 2 message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm. # Because the number of endpoint is much more than service and instance. # # endpoint_avg_rule: # metrics-name: endpoint_avg # op: ">" # threshold: 1000 # period: 10 # count: 2 # silence-period: 5 # message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes webhooks: # - http://127.0.0.1/notify/ # - http://127.0.0.1/go-wechat/ - http://127.0.0.1:8068/alarm/receive
第二部:Java的POJO
import lombok.Data; import java.util.List; import java.util.Map; @Data public class SwAlarmDTO { private int scopeId; private String scope; private String name; private String id0; private String id1; private String ruleName; private String alarmMessage; private List<Tag> tags; private long startTime; private transient int period; private transient boolean onlyAsCondition; @Data public static class Tag{ private String key; private String value; } }
第三步:Controller层:
import com.tulingxueyuan.dto.SwAlarmDTO; import lombok.RequiredArgsConstructor; import lombok.extern.slf4j.Slf4j; import org.springframework.web.bind.annotation.PostMapping; import org.springframework.web.bind.annotation.RequestBody; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; import java.util.List; @Slf4j @RestController @RequiredArgsConstructor @RequestMapping("/alarm") public class SwAlarmController { /** * 接收skywalking服务的告警通知并发送至邮箱 * * 必须是post */ @PostMapping("/receive") public void receive(@RequestBody List<SwAlarmDTO> alarmList) { /* SimpleMailMessage message = new SimpleMailMessage(); // 发送者邮箱 message.setFrom(from); // 接收者邮箱 message.setTo(from); // 主题 message.setSubject("告警邮件"); String content = getContent(alarmList); // 邮件内容 message.setText(content); sender.send(message);*/ String content = getContent(alarmList); log.info("告警邮件已发送..."+content); } private String getContent(List<SwAlarmDTO> alarmList) { StringBuilder sb = new StringBuilder(); for (SwAlarmDTO dto : alarmList) { sb.append("scopeId: ").append(dto.getScopeId()) .append("\nscope: ").append(dto.getScope()) .append("\n目标 Scope 的实体名称: ").append(dto.getName()) .append("\nScope 实体的 ID: ").append(dto.getId0()) .append("\nid1: ").append(dto.getId1()) .append("\n告警规则名称: ").append(dto.getRuleName()) .append("\n告警消息内容: ").append(dto.getAlarmMessage()) .append("\n告警时间: ").append(dto.getStartTime()) .append("\n标签: ").append(dto.getTags()) .append("\n\n---------------\n\n"); } return sb.toString(); } }