Hystrix服务降级-服务熔断

分布式系统面临的问题

复杂分布式体系结构中的应用程序有数十个依赖关系，每个依赖关系在某些时候（异常故障）将不可避免出现损失的情况。

下面详细说

服务雪崩

分布式系统环境下，通常会有很多层的服务调用。由于网络原因或自身的原因，服务一般无法保证 100% 可用。如果一个服务出现了问题，调用这个服务就会出现线程阻塞的情况，此时若有大量的请求涌入，就会出现多条线程阻塞等待，进而导致服务瘫痪。

多个微服务之间调用的时候，假设微服务A调用微服务B和微服务C,微服务B和微服务C又调用其它的微服务，这就是所谓的"扇出"。如果扇出的链路上某个微服务的调用响应时间过长或者不可用,对微服务A的调用就会占用越来越多的系统资源，进而引起系统崩溃，就是服务故障的“雪崩效应”.

对于高流量的应用来说，单一的后端依赖可能会导致所有服务器上的所有资源都在几秒钟内饱和。比失败更糟糕的是，这些应用程序还可能导致服务之间的延迟增加，备份队列，线程和其他系统资源紧张，导致整个系统发生更多的级联故障。这些都表示需要对故障和延迟进行隔离和管理，以便单个依赖关系的失败,不能取消整个应用程序或系统。

所以,通常当你发现一个模块下的某个实例失败后,这时候这个模块依然还会接收流量,然后这个有问题的模块还调用了其他的模块,这样就会发生级联故障,或者叫雪崩。

要防止雪崩的扩散，我们就要做好服务的容错：保护自己不被猪队友拖垮的一些措施。

常见的容错方案：隔离、超时、限流、熔断、降级

Hystrix

Hystrix是一个用于处理分布式系统的延迟和容错的开源库, 在分布式系统里,许多依赖不可避免的会调用失败，比如超时、异常等。

Hystrix能够保证在一个依赖出问题的情况下，不会导致整体服务失败，避免级联故障,以提高分布式系统的弹性。

“断路器”本身是一种开关装置,当某个服务单元发生故障之后，通过断路器的故障监控(类似熔断保险丝)，向调用方返回一个符合预期的、可处理的备选响应(FallBack) ，而不是长时间的等待或者抛出调用方无法处理的异常，这样就保证了服务调用方的线程不会被长时间、不必要地占用,从而避免了故障在分布式系统中的蔓延,乃至雪崩。

目前：Hystrix已经停更，后面会使用阿里的sentinel，但是Hystrix仍然有值得学习的思想和设计。

准备工作

为了学习这个Hystrix，我们需要先做以下8001服务提供者的准备工作。

新建模块 cloud-provider-hystrix-payment8001

依赖：

<dependencies>
    <!--hystrix-->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
    </dependency>
    <!--eureka client-->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-netflix-eureka-client</artifactId>
    </dependency>
    <!--web-->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    <!--引入自定义的api通用包，可用使用Payment支付Entity-->
    <dependency>
        <groupId>com.xn2001.springcloud</groupId>
        <artifactId>cloud-api-commons</artifactId>
        <version>1.0-SNAPSHOT</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-devtools</artifactId>
        <scope>runtime</scope>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>

application.yml

这里我们为了快速高效，直接用单机版Eureka

server:
  port: 8001

spring:
  application:
    name: cloud-provider-hystrix-payment

eureka:
  client:
    #表示是否将自己注册进EurekaServer默认为true
    register-with-eureka: true
    #是否从EurekaServer抓取已有的注册信息，默认为true。单节点无所谓，集群必须设置为true才能配合ribbon使用负载均衡
    fetch-registry: true
    service-url:
      defaultZone: http://eureka7001.com:7001/eureka
#      defaultZone: http://eureka7001.com:7001/eureka/,http://eureka7002.com:7002/eureka/

主启动类

@SpringBootApplication
@EnableDiscoveryClient
public class PaymentHystrixMain8001 {
    public static void main(String[] args) {
        SpringApplication.run(PaymentHystrixMain8001.class,args);
    }
}

业务层：

不写接口了，直接写实现类吧。

@Service
public class PaymentService {

    public String paymentInfoOK(Integer id){
        return "当前线程: "+Thread.currentThread().getName()+"paymentInfo_OK,id："+id+"t"+"O(∩_∩)O哈哈~";
    }

    public String paymentInfoTimeOut(Integer id){
        int timeout=3;
        try {
            TimeUnit.SECONDS.sleep(timeout);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return "线程池："+Thread.currentThread().getName()+"   paymentInfo_Timeout,id："+id+"t"+"┭┮﹏┭┮呜呜~"+"  耗时(秒)："+timeout;
    }
}

控制层：

package com.xn2001.springcloud.controller;
import com.xn2001.springcloud.service.PaymentService;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;
import javax.annotation.Resource;

/**
 * Created by 乐心湖 on 2020/5/21 0:03
 */
@RestController
@Slf4j
public class PaymentHystrixController {

    @Resource
    private PaymentService paymentService;
    @Value("${server.port}")
    private String serverPort;

    @GetMapping(value = "/payment/hystrix/ok/{id}")
    public String paymentInfoOK(@PathVariable("id") Integer id){
        String result = paymentService.paymentInfoOK(id);
        log.info("*****result："+result);
        return result;
    }

    @GetMapping(value = "/payment/hystrix/timeout/{id}")
    public String paymentInfoTimeOut(@PathVariable("id") Integer id){
        String result = paymentService.paymentInfoTimeOut(id);
        log.info("*****result："+result);
        return result;
    }
}

直接运行启动

访问：http://localhost:8001/payment/hystrix/ok/1

http://localhost:8001/payment/hystrix/timeout/2

这个需要等待3秒

压力测试

需要用到JMeter工具，官方下载地址

下载解压后，进入bin目录，双击jmeter.bat即可启动，

注意：如果英文版看不习惯，可以修改bin目录下jmeter.properties文件

加入language=zh_CN。直接加到38行

我们开始去压测

查看效果：

你会发现你2万个线程访问的是http://localhost:8001/payment/hystrix/timeout/2

但是你此时访问http://localhost:8001/payment/hystrix/ok/2依然受到一定的影响，不能秒加载出来。（这里要是看不出效果，可以把线程干到20万）

所以说，大家都是同一个微服务，此时timeout压力大，服务器集中去处理这2万个线程了，所以导致ok这边的路径会拖累了一些。

值得一提的是：Hystrix在服务端和消费端都是可以使用的，但一般用在80消费端。

我们加入服务消费者80端口模块

cloud-consumer-feign-hystrix-order80

依赖：

<dependencies>

    <!--    openfeign    -->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-openfeign</artifactId>
    </dependency>
    <!--   hystrix     -->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-netflix-hystrix</artifactId>
    </dependency>
    <!--   eureka  client    -->
    <dependency>
        <groupId>org.springframework.cloud</groupId>
        <artifactId>spring-cloud-starter-netflix-eureka-client</artifactId>
    </dependency>
    <!--   自定义的api通用包     -->
    <dependency>
        <groupId>com.xn2001.springcloud</groupId>
        <artifactId>cloud-api-commons</artifactId>
        <version>1.0-SNAPSHOT</version>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-devtools</artifactId>
        <scope>runtime</scope>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>

</dependencies>

启动类

package com.xn2001.springcloud;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cloud.openfeign.EnableFeignClients;

/**
 * Created by 乐心湖 on 2020/5/21 11:32
 */
@SpringBootApplication
@EnableFeignClients
public class OrderHystrixMain80 {
    public static void main(String[] args) {
        SpringApplication.run(OrderHystrixMain80.class,args);
    }
}

业务调用接口层

这里的业务层看得懵逼的话大家可以理解为：这个方法使用了Feign去调用8001接口，也就是直接通过uri路径的方式去获取数据。

具体我在Feign服务调用篇已经写过了。

package com.xn2001.springcloud.service;

import org.springframework.cloud.openfeign.FeignClient;
import org.springframework.stereotype.Component;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;

/**
 * Created by 乐心湖 on 2020/5/21 11:34
 */

@FeignClient(value = "CLOUD-PROVIDER-HYSTRIX-PAYMENT")
public interface PaymentHystrixService {

    @GetMapping("/payment/hystrix/ok/{id}")
    String paymentInfoOK(@PathVariable("id") Integer id);

    @GetMapping("/payment/hystrix/timeout/{id}")
    String paymentInfoTimeOut(@PathVariable("id") Integer id);

}

控制层

package com.xn2001.springcloud.controller;

import com.xn2001.springcloud.service.PaymentHystrixService;
import lombok.extern.slf4j.Slf4j;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;

import javax.annotation.Resource;

/**
 * Created by 乐心湖 on 2020/5/21 11:35
 */
@RestController
@Slf4j
public class OrderHystrixController {

    @Resource
    private PaymentHystrixService paymentHystrixService;

    @GetMapping(value ="/consumer/payment/hystrix/ok/{id}")
    public String paymentInfoOK(@PathVariable("id") Integer id){
        String result = paymentHystrixService.paymentInfoOK(id);
        return  result;
    }

    @GetMapping(value ="/consumer/payment/hystrix/timeout/{id}")
    public String paymentInfoTimeOut(@PathVariable("id") Integer id){
        String result = paymentHystrixService.paymentInfoTimeOut(id);
        return result;
    }
}

服务降级

为什么需要服务降级呢，就是当我们出现异常，超时这种特殊情况时，去调用别的方法来保护这个微服务。

出错有兜底，方为全局之际。

比如：

对方服务(8001)超时了，调用者(80)不能一直卡死等待，必须有服务降级

对方服务(8001 )down机了，调用者(80)不能一直卡死等待，必须有服务降级

对方服务(8001)OK，调用者(80)自己出故障或有自我要求(自己的等待时间小于服务提供者）这时候80自己也必须有服务降级。

下面我将介绍全局的服务降级的使用

案例服务降价的处理将设计在客户端80，你也可以使用到8001，这都随便你，根据具体的业务需求来选即可。

现在我们使用了Feign的Service接口（PaymentHystrixService）两个方法中，我们要对这两个方法进行服务降级，当他们异常时就去调用降级后的方法。

@FeignClient(value = "CLOUD-PROVIDER-HYSTRIX-PAYMENT")
public interface PaymentHystrixService {

    @GetMapping("/payment/hystrix/ok/{id}")
    String paymentInfoOK(@PathVariable("id") Integer id);

    @GetMapping("/payment/hystrix/timeout/{id}")
    String paymentInfoTimeOut(@PathVariable("id") Integer id);

}

我们新建一个类 PaymentFallbackService ，去实现这个接口，统一处理异常。

package com.xn2001.springcloud.service;

import org.springframework.stereotype.Component;

/**
 * Created by 乐心湖 on 2020/5/21 18:47
 */
@Component
public class PaymentFallbackService implements PaymentHystrixService{
    @Override
    public String paymentInfoOK(Integer id) {
        return "--------paymentFallbackService fall back paymentInfoOK ┭┮﹏┭┮";
    }

    @Override
    public String paymentInfoTimeOut(Integer id) {
        return "--------paymentFallbackService fall back paymentInfoTimeOut ┭┮﹏┭┮";
    }
}

然后我们怎么让他们关联起来呢，仅仅靠实现接口是不够的。

我们在PaymentHystrixService接口中的@FeignClient注解再添加一个属性：

fallback = PaymentFallbackService.class （这里是你处理降级的类）

结果如下。

package com.xn2001.springcloud.service;

import org.springframework.cloud.openfeign.FeignClient;
import org.springframework.stereotype.Component;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;

/**
 * Created by 乐心湖 on 2020/5/21 11:34
 */

@FeignClient(value = "CLOUD-PROVIDER-HYSTRIX-PAYMENT",fallback = PaymentFallbackService.class)
public interface PaymentHystrixService {

    @GetMapping("/payment/hystrix/ok/{id}")
    String paymentInfoOK(@PathVariable("id") Integer id);

    @GetMapping("/payment/hystrix/timeout/{id}")
    String paymentInfoTimeOut(@PathVariable("id") Integer id);

}

最后在application.yml开启配置

feign:
  hystrix:
    enabled: true

在启动类上添加一个注解@EnableHystrix

此时访问：

http://localhost/consumer/payment/hystrix/ok/5和http://localhost/consumer/payment/hystrix/timeout/6

聪明的你就一切都明白了，前者正常访问，所以输出的是

当前线程: http-nio-8001-exec-1paymentInfo_OK,id：5 O(∩_∩)O哈哈~

后者因为我们让线程等待了3秒，但Feign默认超过1秒就会报超时异常（在我们没有配置的情况下），这一块不懂的可以去看看我写的Feign服务调用的博客，当出现了异常返回错误界面给用户总是不好的，所以这个服务降级就会去调用我们刚刚写好的降级方法，输出了

--------paymentFallbackService fall back paymentInfoTimeOut ┭┮﹏┭┮

这样做让客户端在服务端不可用时也会获得提示信息而不会挂起耗死服务器

除了全局的服务降级，还有单个方法的。

每个方法配置一个服务降级方法，技术上可以，实际上是傻逼的。

除了个别重要核心业务有专属，其它普通的可以通过全局统一处理。

额外

这里还有一种全局服务降级，就是去指定统一的处理方法。优先级比上面那种实现类的要高

我们直接去controller层测试效果

添加一个方法：

public String paymentGlobalFallbackMethod(){
    return "Global异常处理信息,请稍后再试: orz~";
}

添加注解：里面的defaultFallback属性填写的是你的方法名噢

@DefaultProperties(defaultFallback = "paymentGlobalFallbackMethod")

我们把接口（PaymentHystrixService）中的fallback = PaymentFallbackService.class 去掉。

在等待3秒的方法上添加@HystrixCommand注解

最后这个类长的是这样：

package com.xn2001.springcloud.controller;

import com.netflix.hystrix.contrib.javanica.annotation.DefaultProperties;
import com.netflix.hystrix.contrib.javanica.annotation.HystrixCommand;
import com.xn2001.springcloud.service.PaymentHystrixService;
import lombok.extern.slf4j.Slf4j;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RestController;
import javax.annotation.Resource;

/**
 * Created by 乐心湖 on 2020/5/21 11:35
 */
@RestController
@Slf4j
@DefaultProperties(defaultFallback = "paymentGlobalFallbackMethod")
public class OrderHystrixController {

    @Resource
    private PaymentHystrixService paymentHystrixService;

    @GetMapping(value ="/consumer/payment/hystrix/ok/{id}")
    public String paymentInfoOK(@PathVariable("id") Integer id){
        String result = paymentHystrixService.paymentInfoOK(id);
        return  result;
    }

    @GetMapping(value ="/consumer/payment/hystrix/timeout/{id}")
    @HystrixCommand
    public String paymentInfoTimeOut(@PathVariable("id") Integer id){
        String result = paymentHystrixService.paymentInfoTimeOut(id);
        return result;
    }

    public String paymentGlobalFallbackMethod(){
        return "Global异常处理信息,请稍后再试: orz~";
    }

}

启动，访问http://localhost/consumer/payment/hystrix/timeout/6

接下来你就可以自己去测试别的了，比如把8001关了，默认这个服务死了，然后你再访问看看会不会影响。

服务熔断

什么个东西呢，不慌，就是类似于服务降级的东西，只是这个是全局的。下面我们去试试就知道了。（往下看之前请确保你知道服务降级这东西）

为了减少微服务启动（表示电脑已经越来越吃不消了），我们简陋点，把服务熔断做到8001提供者吧。

PaymentService添加一个新的方法：（具体解释工作原理我在后面说）

当id<0时，会抛异常，发生错误，但是此时我们有服务降级，就会调用到降级后的方法。

 @HystrixCommand(fallbackMethod="paymentCircuitBreakerFallback", commandProperties={
            @HystrixProperty(name = "circuitBreaker.enabled" ,value = "true"),//是否开启断路器
            @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold",value = "10"),//请求总数阀值
            @HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds",value = "10000"),//休眠时间窗
            @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage",value = "60")//错误百分比阀值
            })
    public String paymentCircuitBreaker(@PathVariable("id") Integer id){
        if (id<0){
            throw new RuntimeException();
        }
        String serialNumber = IdUtil.simpleUUID();
        return Thread.currentThread().getName()+"t "+"调用成功,流水号: "+serialNumber;
    }

    // 降级后的方法
    public String paymentCircuitBreakerFallback(@PathVariable("id") Integer id){
        return "id不能为负数,请稍后再试~ id: "+ id;
    }

控制层接口：

@GetMapping("/payment/circuit/{id}")
public String paymentCircuitBreaker(@PathVariable("id") Integer id){
    String circuitBreaker = paymentService.paymentCircuitBreaker(id);
    log.info("******result: "+circuitBreaker);
    return circuitBreaker;
}

注意：检查启动类上是否有注解@EnableHystrix

一共三个注解：

@SpringBootApplication
@EnableDiscoveryClient
@EnableHystrix

启动我们的Eureka注册中心和8001服务提供者。

访问http://localhost:8001/payment/circuit/3和http://localhost:8001/payment/circuit/-3 查看是否正常显示。

我们用上面介绍的压力测试工具进行测试。

我们去测试对一个错误接口开启11个线程访问，会出现什么情况。

此时你访问：http://localhost:8001/payment/circuit/3你会发现结果居然是id不能为负数,请稍后再试~ id: 3，这就是所谓的熔断器，类似于保险丝，你可以理解为：当你一个路线的错误次数超过额定（我们测试用了11个线程，而且这11个线程都是访问了错误接口，百分比的错误率。），所以熔断器就拉闸了。更牛逼的是，过了10秒后（即我们代码写的休眠时间窗），他就自动恢复了（因为此时我们已经没有持续对错误的路径接口进行访问，具体工作流程下面我会整理）。

工作流程

GitHub官方文档里的介绍是这样的：

Circuit Breaker The following diagram shows how a or interacts with a HystrixCircuitBreaker and its flow of logic and decision-making, including how the counters behave in the circuit breaker.HystrixCommand`HystrixObservableCommand` The precise way that the circuit opening and closing occurs is as follows:

Assuming the volume across a circuit meets a certain threshold ()...HystrixCommandProperties.circuitBreakerRequestVolumeThreshold()
And assuming that the error percentage exceeds the threshold error percentage ()...HystrixCommandProperties.circuitBreakerErrorThresholdPercentage()
Then the circuit-breaker transitions from to .CLOSED`OPEN`
While it is open, it short-circuits all requests made against that circuit-breaker.
After some amount of time (), the next single request is let through (this is the state). If the request fails, the circuit-breaker returns to the state for the duration of the sleep window. If the request succeeds, the circuit-breaker transitions to and the logic in 1. takes over again.HystrixCommandProperties.circuitBreakerSleepWindowInMilliseconds()`HALF-OPENOPENCLOSED`

我的理解

这是我自己参考官方介绍以及视频教学整理的熔断器工作流程思路。

快照时间窗: 断路器确定是否打开需要统计一些请求和错误数据，而统计的时间范围就是快照时间窗,默认为最近的10秒。
请求总数阀值: 在快照时间窗内，必须满足请求总数阀值才有机会熔断。默认为20次, 意味着在10秒内（快照时间窗）,如果该hystrix命令的调用次数不足20次,即使所有的请求都超时或其他原因失败，断路器都不会打开。
错误百分比阀值: 当请求总数在快照时间窗内超过了阀值，比如发生了30次调用（超过默认的20），如果在这30次调用中，有15次发生了超时异常，也就是超过50%的错误百分比，在默认设定50%阀值情况下，这时候就会将断路器打开。
当断路器打开,对主逻辑进行熔断之后，hystrix会启动一个休眠时间窗(默认为50秒)在这个时间窗内，降级逻辑是临时的成为主逻辑，当休眠时间窗到期时，断路器将进入半开状态，当你释放一次请求到原来的主逻辑上，如果此次请求正常返回，那么断路器将重新闭合，主逻辑也就恢复了。如果这次请求依然有问题，断路器继续维持打开状态，休眠时间窗到期时重新计时。

值得一提的是：这些默认配置以及参数在IDEA中按两下shift，输入HystrixCommandProperties，下载源码注释即可看到这些东西噢。