Springboot2 Metrics之actuator集成influxdb, Grafana提供监控和报警

到目前为止，各种日志收集，统计监控开源组件数不胜数，即便如此还是会有很多人只是tail -f查看一下日志文件。随着容器化技术的成熟，日志和metrics度量统计已经不能仅仅靠tail -f来查看了，你甚至都不能进入部署的机器。因此，日志收集和metrics统计就必不可少。日志可以通过logstash或者filebeat收集到ES中用来查阅。对于各种统计指标，springboot提供了actuator组件，可以对cpu, 内存，线程，request等各种指标进行统计，并收集起来。本文将粗略的集成influxdb来实现数据收集，以及使用Grafana来展示。

最终dashboard模板： https://github.com/Ryan-Miao/boot-metrics-exporter/blob/master/grafana/grafana-dashboard-template.json

最终获得如下统计报表：

对于redis cache命中率的统计：

对于单独重要request的统计

基于health check的alert

安装influxdb和Grafana

安装influxdb:

https://www.cnblogs.com/woshimrf/p/docker-influxdb.html

安装Grafana:

https://www.cnblogs.com/woshimrf/p/docker-grafana.html

Springboot配置

可以直接使用封装好的starter:

https://github.com/Ryan-Miao/boot-metrics-exporter

或者：

引入依赖

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-influx</artifactId>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
        </dependency>

定义MeterConfig, 用来统一设置一些tag，比如instance id

@Component
public class MeterConfig implements MeterRegistryCustomizer {

    private static final Logger LOGGER = LoggerFactory.getLogger(MeterConfig.class);

    @Override
    public void customize(MeterRegistry registry) {
        try {
            String hostAddress = InetAddress.getLocalHost().getHostAddress();
            if (LOGGER.isDebugEnabled()) {
                LOGGER.debug("设置metrics实例id为ip:" + hostAddress);
            }
            registry.config().commonTags("instance-id", hostAddress);
        } catch (UnknownHostException e) {
            String uuid = UUID.randomUUID().toString();
            registry.config().commonTags("instance-id", uuid);
            LOGGER.error("获取实例ip失败，设置实例id为uuid:" + uuid, e);
        }
    }
}

添加对应的配置：

management:
  metrics:
    export:
      influx:
        db: my-db
        uri: http://192.168.5.9:8086
        user-name: admin
        password: admin
        enabled: true
    web:
      server:
        auto-time-requests: true
    tags:
      app: ${spring.application.name}

这里选择将metric export到influxdb，还有很多其他存储方案可选。

网络配置

grafana和influxdb可能部署在某个vpc，比如monitor集群。而需要收集的业务系统则遍布在各个业务线的vpc内，因此需要业务集群打通访问influxdb的网络和端口。

自定义Metrics

Springboot actuator暴露的health接口只有up/down的选择，在grafana如何使用这个来判断阈值，我还没找到，于是转换成了数字。

自定义MeterBinder

import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.binder.MeterBinder;
import lombok.Data;

@Data
public class HealthMetrics implements MeterBinder {

    /**
     * 100  up
     * 0  down
     * 0 unknown
     */
    private Integer health = 100;


    @Override
    public void bindTo(MeterRegistry registry) {
        Gauge.builder("health", () -> health)
                .register(registry);
    }
}

定义每30s更新一下状态：

public abstract class AbstractHealthCheckStatusSetter {
    private final HealthMetrics healthMetrics;

    protected AbstractHealthCheckStatusSetter(HealthMetrics healthMetrics) {
        this.healthMetrics = healthMetrics;
    }

    /**
     * 修改health的状态定义。修改HealthMetrics.health的value。
     */
    public abstract void setHealthStatus(HealthMetrics h);

    /**
     * 定时更新health统计.
     */
    @PostConstruct
    void doSet() {
        ScheduledExecutorService scheduledExecutorService = new ScheduledThreadPoolExecutor(1);
        scheduledExecutorService.scheduleWithFixedDelay(
                () -> setHealthStatus(healthMetrics), 30L, 30L, TimeUnit.SECONDS);
    }


}

实现类

public class HealthCheckStatusSetter extends AbstractHealthCheckStatusSetter {
    private final HealthEndpoint healthEndpoint;

    public HealthCheckStatusSetter(HealthMetrics healthMetrics, HealthEndpoint healthEndpoint) {
        super(healthMetrics);
        this.healthEndpoint = healthEndpoint;
    }


    @Override
    public void setHealthStatus(HealthMetrics healthMetrics) {
        Health health = healthEndpoint.health();
        if (health != null) {
            Status status = health.getStatus();
            switch (status.getCode()) {
                case "UP": {
                    healthMetrics.setHealth(100);
                    break;
                }
                case "DOWN":
                    ;
                case "UNKNOWN":
                    ;
                default: {
                    healthMetrics.setHealth(0);
                    break;
                }

            }
        }

    }
    

}

加入配置

    @Bean
    @ConditionalOnMissingBean
    public HealthMetrics healthMetrics() {
        return new HealthMetrics();
    }

    /**
     * 这里采用healthEndpoint来判断系统的健康状况。如果有别的需要，可以实现AbstractHealthCheckStatusSetter，自己设置health.
     */
    @Bean
    @ConditionalOnMissingBean
    @ConditionalOnBean(HealthEndpoint.class)
    public AbstractHealthCheckStatusSetter healthCheckSchedule(HealthEndpoint healthEndpoint, HealthMetrics healthMetrics) {
        return new HealthCheckStatusSetter(healthMetrics, healthEndpoint);
    }

Redis cache命中率统计

整套metrics监控是基于Spring boot actuator来实现的，而actuator是通过io.micrometer来做统计的。那么就可以通过自定义micrometer metrics的方式来添加各种metric。比如我们常用redis作为缓存，那么缓存的命中率是我们所关注的。可以自己写一套counter来记录：命中hit+1，没命中miss+1.

也可以直接使用redisson。

我们使用RedissonCache来集成spring cache, 此时cache的命中统计指标就已经被收集好了。

Cache基本统计指标的定义：

然而，统计的结果是按行存储的：

怎么基于此计算命中率呢？

hit-rate= sum(hit)/sum(hit+miss)

因此，我手动对这个序列做了整合：

DROP CONTINUOUS QUERY cq_cache_hit ON my-db

DROP CONTINUOUS QUERY cq_cache_miss ON my-db

DROP measurement cache_hit_rate

CREATE CONTINUOUS QUERY "cq_cache_hit" ON "my-db" RESAMPLE EVERY 10m BEGIN SELECT sum("value") AS hit  INTO "cache_hit_rate"  FROM "rp_30days"."cache_gets" WHERE ( "result" = 'hit') GROUP BY time(10m),"app", "cache"  fill(0) END

CREATE CONTINUOUS QUERY "cq_cache_miss" ON "my-db" RESAMPLE EVERY 10m BEGIN SELECT sum("value") AS miss  INTO "cache_hit_rate"  FROM "rp_30days"."cache_gets" WHERE ( "result" = 'miss') GROUP BY time(10m),"app", "cache" fill(0) ENDD

监控告警

Grafana提供了alert功能，当查询的指标不满足阈值时，发出告警。

选择influxdb or Prometheus ?

关于收集metric指标的存储方案，大多数教程都是Prometheus, 生态比较完整。我当时之所以选择influxdb，仅仅是因为容器的网络问题。Prometheus需要访问实例来拉取数据，需要允许Prometheus访问业务网络，那我就得不停打通网络，而且，k8s集群不同的网络是不通的，没找到网络打通方案。而influx这种只要实例push数据。同样的，还可以选择es。

influxdb有单点局限性，以及数量大之后的稳定性等问题。需要合理的计算时间间隔的数据。比如，对于几天几个月等查询，提前汇总细粒度的统计。

还有一种据说可以无限扩展的方案就是OpenTSDB. 暂未研究。

会遇到的问题

当前demo是influxdb单点，极其脆弱，稍微长点的时间间隔查询就会挂掉，也只能用来做demo，或者只是查看最近15min这种简单的实时查看。对于近几个月，一年这种长时间聚合，只能提前做好聚合函数进行粗粒度的统计汇总。

参考

https://github.com/OpenTSDB/opentsdb