SkyWalking 告警功能

告警规则:
SkyWalking 告警功能是在6.x版本新增的,其核心由一组规则驱动,这些规则定义在config/alarm-settings.yml文件中。 告警规则的定义分为两部分:
1. 告警规则:它们定义了应该如何触发度量警报,应该考虑什么条件。
https://github.com/apache/skywalking/blob/v8.5.0/docs/en/setup/backend/backend-alarm.md
 
2. Webhook(网络钩子):定义当警告触发时,哪些服务终端需要被告知
https://github.com/apache/skywalking/blob/v8.5.0/docs/en/setup/backend/backend-alarm.md

规则:

 

SkyWalking 的发行版都会默认提供config/alarm-settings.yml文件,里面预先定义了一些常用的告警规则。如下: 
1. 过去 3 分钟内服务平均响应时间超过 1 秒。
2. 过去 2 分钟服务成功率低于80%。
3. 过去 3 分钟内服务响应时间超过 1s 的百分比
4. 服务实例在过去 2 分钟内平均响应时间超过 1s,并且实例名称与正则表达式匹配。
5. 过去 2 分钟内端点平均响应时间超过 1 秒。
6. 过去 2 分钟内数据库访问平均响应时间超过 1 秒。
7. 过去 2 分钟内端点关系平均响应时间超过 1 秒。
这些预定义的告警规则,打开config/alarm-settings.yml文件即可看到

告警规则配置项的说明:

Rule name:规则名称,也是在告警信息中显示的唯一名称。必须以_rule结尾,前缀可自定义 
Metrics name:度量名称,取值为oal脚本中的度量名,目前只支持long、double和int类型。详见Official OAL script (https://github.com/apache/skywalking/blob/master/docs/en/guides/backend-oal-scripts.md)
Include names:该规则作用于哪些实体名称,比如服务名,终端名(可选,默认为全部)
Exclude names:该规则作不用于哪些实体名称,比如服务名,终端名(可选,默认为空)
Threshold:阈值 OP: 操作符,目前支持 >、<、= Period:多久告警规则需要被核实一下。这是一个时间窗口,与后端部署环境时间相匹配
Count:在一个Period窗口中,如果values超过Threshold值(按op),达到Count值,需要发送警报
Silence period:在时间N中触发报警后,在TN -> TN + period这个阶段不告警。 默认情况下,它和Period一样,这意味着 相同的告警(在同一个Metrics name拥有相同的Id)在同一个Period内只会触发一次
message:告警消息
Webhook(网络钩子)

第一步:在接口配置到SkyWalking中,Webhook的配置位于config/alarm-settings.yml文件的末尾,格式为http://{ip}:{port}/{uri}

 

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    metrics-name: database_access_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    metrics-name: endpoint_relation_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/
- http://127.0.0.1:8068/alarm/receive

 

第二部:Java的POJO 

import lombok.Data;

import java.util.List;
import java.util.Map;

@Data
public class SwAlarmDTO {

    private int scopeId;
    private String scope;
    private String name;
    private String id0;
    private String id1;
    private String ruleName;
    private String alarmMessage;
    private List<Tag> tags;
    private long startTime;
    private transient int period;
    private transient boolean onlyAsCondition;

    @Data
    public static class Tag{
        private String key;
        private String value;
    }
}

 第三步:Controller层:

import com.tulingxueyuan.dto.SwAlarmDTO;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import java.util.List;

@Slf4j
@RestController
@RequiredArgsConstructor
@RequestMapping("/alarm")
public class SwAlarmController {


    /**
     * 接收skywalking服务的告警通知并发送至邮箱
     *
     * 必须是post
     */
    @PostMapping("/receive")
    public void receive(@RequestBody List<SwAlarmDTO> alarmList) {
       /* SimpleMailMessage message = new SimpleMailMessage();
        // 发送者邮箱
        message.setFrom(from);
        // 接收者邮箱
        message.setTo(from);
        // 主题
        message.setSubject("告警邮件");
        String content = getContent(alarmList);
        // 邮件内容
        message.setText(content);
        sender.send(message);*/
        String content = getContent(alarmList);
        log.info("告警邮件已发送..."+content);
    }

    private String getContent(List<SwAlarmDTO> alarmList) {
        StringBuilder sb = new StringBuilder();
        for (SwAlarmDTO dto : alarmList) {
            sb.append("scopeId: ").append(dto.getScopeId())
                    .append("\nscope: ").append(dto.getScope())
                    .append("\n目标 Scope 的实体名称: ").append(dto.getName())
                    .append("\nScope 实体的 ID: ").append(dto.getId0())
                    .append("\nid1: ").append(dto.getId1())
                    .append("\n告警规则名称: ").append(dto.getRuleName())
                    .append("\n告警消息内容: ").append(dto.getAlarmMessage())
                    .append("\n告警时间: ").append(dto.getStartTime())
                    .append("\n标签: ").append(dto.getTags())
                    .append("\n\n---------------\n\n");
        }

        return sb.toString();
    }
}

 

 

 

 

posted @ 2022-03-17 22:58  VNone  阅读(470)  评论(0)    收藏  举报