快速学习-Skywalking告警功能

时间:2022-07-22
本文章向大家介绍快速学习-Skywalking告警功能,主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

3.4 告警功能

3.4.1 告警功能简介

Skywalking每隔一段时间根据收集到的链路追踪的数据和配置的告警规则(如服务响应时间、服务响应 时间百分比)等,判断如果达到阈值则发送相应的告警信息。发送告警信息是通过调用webhook接口完 成,具体的webhook接口可以使用者自行定义,从而开发者可以在指定的webhook接口中编写各种告 警方式,比如邮件、短信等。告警的信息也可以在RocketBot中查看到。

以下是默认的告警规则配置,位于skywalking安装目录下的config文件夹下 alarm-settings.yml文件 中:

rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_p90_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_p90
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

以上文件定义了默认的4种规则

  1. 最近3分钟内服务的平均响应时间超过1秒
  2. 最近2分钟服务成功率低于80%
  3. 最近3分钟90%服务响应时间超过1秒
  4. 最近2分钟内服务实例的平均响应时间超过1秒 规则中的参数属性如下

属性参照表

属性

含义

metrics-name

oal脚本中的度量名称

threshold

阈值,与metrics-name和下面的比较符号相匹配

op

比较操作符,可以设定>,<,=

period

多久检查一次当前的指标数据是否符合告警规则,单位分钟

count

达到多少次后,发送告警消息

silence-period

在多久之内,忽略相同的告警消息

message

告警消息内容

include-names

本规则告警生效的服务列表

webhooks可以配置告警产生时的调用地址。

3.4.2 告警功能测试代码

编写告警功能接口来进行测试,创建skywalking_alarm项目。

AlarmController

import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class AlarmController {

    //每次调用睡眠1.5秒,模拟超时的报警
    @GetMapping("/timeout")
    public String timeout(){
        try {
            Thread.sleep(1500);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        return "timeout";
    }
}

该接口主要用于模拟超时,多次调用之后就可以生成告警信息。

WebHooks

import com.sf.saas.skywalking_alarm.pojo.AlarmMessage;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RestController;

import java.util.ArrayList;
import java.util.List;

@RestController
public class WebHooks {

    private List<AlarmMessage> lastList = new ArrayList<>();

    @PostMapping("/webhook")
    public void  webhook(@RequestBody List<AlarmMessage> alarmMessageList){
        lastList = alarmMessageList;
    }

    @GetMapping("/show")
    public List<AlarmMessage> show(){
        return lastList;
    }
}

产生告警时会调用webhook接口,该接口必须是Post类型,同时接口参数使用RequestBody。参 数格式为:

[{
	"scopeId": 1,
	"scope": "SERVICE",
	"name": "serviceA",
	"id0": 12,
	"id1": 0,
	"ruleName": "service_resp_time_rule",
	"alarmMessage": "alarmMessage xxxx",
	"startTime": 1560524171000
}, {
	"scopeId": 1,
	"scope": "SERVICE",
	"name": "serviceB",
	"id0": 23,
	"id1": 0,
	"ruleName": "service_resp_time_rule",
	"alarmMessage": "alarmMessage yyy",
	"startTime": 1560524171000
}]

AlarmMessage

public class AlarmMessage {
    private int scopeId;
    private String name;
    private int id0;
    private int id1;
    //告警的消息
    private String alarmMessage;
    //告警的产生时间
    private long startTime;

    public int getScopeId() {
        return scopeId;
    }

    public void setScopeId(int scopeId) {
        this.scopeId = scopeId;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getId0() {
        return id0;
    }

    public void setId0(int id0) {
        this.id0 = id0;
    }

    public int getId1() {
        return id1;
    }

    public void setId1(int id1) {
        this.id1 = id1;
    }

    public String getAlarmMessage() {
        return alarmMessage;
    }

    public void setAlarmMessage(String alarmMessage) {
        this.alarmMessage = alarmMessage;
    }

    public long getStartTime() {
        return startTime;
    }

    public void setStartTime(long startTime) {
        this.startTime = startTime;
    }

    @Override
    public String toString() {
        return "AlarmMessage{" +
                "scopeId=" + scopeId +
                ", name='" + name + ''' +
                ", id0=" + id0 +
                ", id1=" + id1 +
                ", alarmMessage='" + alarmMessage + ''' +
                ", startTime=" + startTime +
                '}';
    }
}

实体类用于接口告警信息

3.4.3 部署测试

首先需要修改告警规则配置文件,将webhook地址修改为

webhooks: 
  - http://127.0.0.1:8089/webhook

然后重启skywalking 1、将 skywalking_alarm.jar上传至 /usr/local/skywalking目录下。

2、启动skywalking_alarm应用,等待启动成功。

java -javaagent:/usr/local/skywalking/apache-skywalking-apm- 
bin/agent/skywalking-agent.jar -Dskywalking.agent.service_name=skywalking_alarm -jar skywalking_alarm.jar

3、不停调用接口,接口地址为:http://虚拟机IP:8089/timeout

4、直到出现告警:

5、查看告警信息接口:http://虚拟机IP:8089/show

从上图中可以看到,我们已经获取到了告警相关的信息,在生产中使用可以在webhook接口中对接短 信、邮件等平台,当告警出现时能迅速发送信息给对应的处理人员,提高故障处理的速度。