spring cloud 服务容错保护 - Hystrix

1、为什么要断路器

　　在微服务架构中通常会涉及到多个服务间调用，处于调用链路底层的基础服务故障可能会导致级联故障，进而造成整个系统不可用的情况，这种现象被称为服务雪崩效应。服务雪崩效应是一种因“服务提供者”的不可用导致“服务消费者”的不可用,并将不可用范围逐渐放大的过程。大家在开发过程中肯定都遇到了 HTTP Connection Timeout 异常，这其实也是一种熔断器概念，当连接请求一直连不上超时就结束了请求并抛出异常。

2、简单的断路器

添加pom依赖

        <dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-starter-netflix-ribbon</artifactId>
        </dependency>

新建hystrix-server项目，在入口处新增@SpringCloudApplication：

@SpringCloudApplication
public class HystrixApplication {
    public static void main(String[] args) {
        SpringApplication.run( HystrixApplication.class, args );
    }

    /**
     * 实例化RestTemplate，通过@LoadBalanced注解开启均衡负载
     */
    @Bean
    @LoadBalanced
    public RestTemplate restTemplate() {
        return new RestTemplate();
    }
}

View Code

SpringCloudApplication注解包含了 @SpringBootApplication、@EnableDiscoveryClient、@EnableCircuitBreaker 三个注解。说明一个Spring Cloud标准应用应包含了服务发现和断路器

@Target({ElementType.TYPE})
@Retention(RetentionPolicy.RUNTIME)
@Documented
@Inherited
@SpringBootApplication
@EnableDiscoveryClient
@EnableCircuitBreaker
public @interface SpringCloudApplication {
}

给UserService加短路器方法，@HystrixCommand(fallbackMethod = "getUserFallBack") ：

@Service
public class UserService {
    @Autowired
    private RestTemplate restTemplate;

    @HystrixCommand(fallbackMethod = "getUserFallBack")
    public UserDto getUser() {
        long start = System.currentTimeMillis();
        try {
            Thread.sleep(new Random().nextInt(5000));
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        ResponseEntity<UserDto> responseEntity = restTemplate.getForEntity("http://service-1/getUser/{1}", UserDto.class, 2000);
        long end = System.currentTimeMillis();
        System.out.println("cost time : " + (end - start));
        return responseEntity.getBody();
    }

    public UserDto getUserFallBack() {
        UserDto userDto = new UserDto();
        userDto.setUserId(9999L);
        userDto.setName("fall back");
        return userDto;
    }
}

View Code

在 UserService#getUser 里随机 sleep ，多调几次方法，我们发现有时候返回成功，有时候返回失败：

成功：{"userId":9999,"name":"fall back"}

失败：{"userId":2000,"name":"zhangsan"}

失败时控制台输出：

DynamicServerListLoadBalancer:{NFLoadBalancer:name=service-1,current list of Servers=[service-provider:8091, service-provider:8092],Load balancer stats=Zone stats: {defaultzone=[Zone:defaultzone; Instance count:2; Active connections count: 0; Circuit breaker tripped count: 0; Active connections per server: 0.0;]
},Server stats: [[Server:service-provider:8092; Zone:defaultZone; Total Requests:0; Successive connection failure:0; Total blackout seconds:0; Last connection made:Thu Jan 01 08:00:00 CST 1970; First connection made: Thu Jan 01 08:00:00 CST 1970; Active Connections:0; total failure count in last (1000) msecs:0; average resp time:0.0; 90 percentile resp time:0.0; 95 percentile resp time:0.0; min resp time:0.0; max resp time:0.0; stddev resp time:0.0]
, [Server:service-provider:8091; Zone:defaultZone; Total Requests:0; Successive connection failure:0; Total blackout seconds:0; Last connection made:Thu Jan 01 08:00:00 CST 1970; First connection made: Thu Jan 01 08:00:00 CST 1970; Active Connections:0; total failure count in last (1000) msecs:0; average resp time:0.0; 90 percentile resp time:0.0; 95 percentile resp time:0.0; min resp time:0.0; max resp time:0.0; stddev resp time:0.0]
]}ServerList:org.springframework.cloud.netflix.ribbon.eureka.DomainExtractingServerList@ee501dc
cost time : 1610

当然， getUserFallBack 也可能会异常，我们仍然可以以相同的方式给 getUserFallBack 添加熔断处理方法：

@Service
public class UserService {
    @Autowired
    private RestTemplate restTemplate;

    @HystrixCommand(fallbackMethod = "getUserFallBack")
    public UserDto getUser() {
        long start = System.currentTimeMillis();
        try {
            Thread.sleep(new Random().nextInt(5000));
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        ResponseEntity<UserDto> responseEntity = restTemplate.getForEntity("http://service-1/getUser/{1}", UserDto.class, 2000);
        long end = System.currentTimeMillis();
        System.out.println("cost time : " + (end - start));
        return responseEntity.getBody();
    }

    @HystrixCommand(fallbackMethod = "userFallBack")
    public UserDto getUserFallBack() {
        UserDto userDto = new UserDto();
        userDto.setUserId(9999L);
        userDto.setName("fall back");
        return userDto;
    }

    public UserDto userFallBack() {
        return new UserDto();
    }
}

View Code

上面例子是hystrix和eureka、ribbon一起使用，hystrix也可以单独与springboot一起使用

3、SpringBoot集成hystrix

新建一个SpringBoot项目，添加hystrix依赖（hystrix-metrics-event-stream 是用来做Hystrix-Dashboard控制台的）：

        <dependency>
            <groupId>com.netflix.hystrix</groupId>
            <artifactId>hystrix-core</artifactId>
            <version>1.5.18</version>
        </dependency>
        <!-- http://mvnrepository.com/artifact/com.netflix.hystrix/hystrix-metrics-event-stream -->
        <dependency>
            <groupId>com.netflix.hystrix</groupId>
            <artifactId>hystrix-metrics-event-stream</artifactId>
            <version>1.5.18</version>
        </dependency>
        <dependency>
            <groupId>com.netflix.hystrix</groupId>
            <artifactId>hystrix-javanica</artifactId>
            <version>1.5.18</version>
        </dependency>

添加hystrix配置：

@Configuration
public class HystrixConfig {
    /**
     * 用来像监控中心Dashboard发送stream信息
     *
     * A {@link ServletContextInitializer} to register {@link Servlet}s in a Servlet 3.0+ container.
     */
    @Bean
    public ServletRegistrationBean hystrixMetricsStreamServlet() {
        return new ServletRegistrationBean(new HystrixMetricsStreamServlet(), "/hystrix.stream");
    }

    /**
     * 用来拦截处理HystrixCommand注解
     *
     * AspectJ aspect to process methods which annotated with {@link HystrixCommand} annotation.
     *
     * {@link HystrixCommand} annotation used to specify some methods which should be processes as hystrix commands.
     */
    @Bean
    public HystrixCommandAspect hystrixCommandAspect() {
        return new HystrixCommandAspect();
    }
}

View Code

访问几次断路接口，然后再访问 http://localhost:9101/hystrix.stream 出现如下界面：

1.github上下载源码https://github.com/kennedyoliveira/standalone-hystrix-dashboard
2.参考其wiki文档,部署成功后,默认端口是7979;
3.点击 http://localhost:9101/hystrix.stream 打开页面,出现小熊即为成功, 有个js是国外的,所以FQ或者忍着等待;
4.输入地址 http://localhost:9101/hystrix.stream，点击add stream ，然后点 Monitor Streams

4、使用注意

　　Docker使用"舱壁模式"实现进程的隔离，使得容器与容器之间不会互相影响。而Hystrix则使用该模式实现线程池的隔离，它会为每一个依赖服务创建一个独立的线程池，这样就算某个依赖服务出现延迟过高的情况，也只会影响对该服务依赖方的调用，而不会拖慢其他的服务。Hystrix通过对依赖服务实现线程池隔离，让我们的应用更加健壮，但是如果为每一个服务都分配一个线程池是会增加系统开销的，Netflix设计的时候也考虑过这个问题，并且认为线程池开销相对于服务隔离是好处多于劣处，并且官方性能测试也表现的很不错。但是如果系统对性能要求非常苛刻，Hystrix还提供了信号量来控制单个依赖服务的并发度，信号量开销远比线程池开销小，但是不能设置超时和实现异步访问，最好在依赖服务足够可靠情况下才使用信号量方式。

　　Hystrix服务降级这么好用，是不是最好为所有的服务都加上呢？当然不是，我们已经知道了：Hystrix是为每个服务创建一个隔离的线程池来保证服务健壮性的，肯定存在有性能损失，如果不需要就没必要使用。有些情况是不需要使用服务降级的，比如：执行写操作、执行批处理、异步计算等等，如果失败了，只需要告诉调用者成功或失败了即可，没太大必要使用服务降级

1、异常处理

　　当继承HystrixCommand类，在Hystrix实现的 run 方法中抛出异常时，除了HystrixBadRequestException 之外，都会被用来执行触发服务降级的逻辑处理，所以在run里如果不希望自己抛出的异常被用来作为服务降级处理，需要使用HystrixBadRequestException异常。当使用 @HystrixCommand 时，可以设置 ignoreExceptions = {BusinessException.class} 可以指定忽略某些异常，当抛出这些异常的时候，Hystrix会将它包装在 HystrixBadRequestException 抛出，不会用来触发服务降级。

我们常常在程序里自定义异常，并统一处理

@ControllerAdvice
public class CommonExceptionHandler {
    @ExceptionHandler(value = Exception.class)
    @ResponseBody
    public String handlerException(Exception e){
        return e.toString();
    }
}

View Code

如：抛出自定义异常 BizException 时，会触发fallBack方法，这显然不是我们想要的结果

    public String printFallBack(String msg) {
        return "print error !" + msg + " time : " + new Date();
    }

    @HystrixCommand(fallbackMethod = "printFallBack")
    public String error(String result) {
        throw new BizException(9999, result);
　　　　　//throw new HystrixBadRequestException("HystrixBadRequestException" + result);
    }

View Code

只需要在@HystrixCommand里指定 ignoreExceptions

@HystrixCommand(fallbackMethod = "printFallBack", ignoreExceptions = {BizException.class})

自定义异常即可不被Hystrix当做熔断逻辑处理

2、组、命令、线程池

　　通过设置命令组，Hystrix会根据组来统计命令告警、仪表盘等信息。Hystrix默认的线程池划分也是根据命令分组来实现的。默认情况下，Hystrix会让相同组名的命令使用同一个线程池，所以我们需要在创建Hystrix命令时候为其指定命令组名来代替默认的线程池划分。如果采用继承方式来实现Hystrix命令，默认采用类名作为命令名称，也可以在构建函数里指定名称来覆盖。依靠命令组名（GroupKey）来划分线程池也有缺点，比如使用UserGroup来命名了组名下面有很多命令（CommandKey）如login、register…这些命令之间是无法隔离的，这时候我们可以使用HystrixThreadPoolKey来对线程池进行更细粒度的划分，通过设置login、register的HystrixThreadPoolKey不同来划分线程池。如果没有指定HystrixThreadPoolKey，依然会使用命令组方式来划分线程池。

　　同样的对于@HystrixCommand来说分别对应了 GroupKey、CommandKey、ThreadPoolKey（分别表示命令组、命令名称、线程池）方式划分。

@HystrixCommand(groupKey = "userGroup", commandKey = "login", threadPoolKey = "loginThreadPool")
@HystrixCommand(groupKey = "userGroup", commandKey = "register", threadPoolKey = "registerThreadPool")

3、请求缓存

在高并发场景下，Hystrix提供了请求缓存功能，可以用来开启缓存优化系统。可以继承HystrixCommand方式实现或通过注解方式实现（Demo）

通过继承HystrixCommand方式实现：

public class UserCommand extends HystrixCommand<User> {
    private Integer userId;

    @Override
    protected User run() throws Exception {
        System.out.println("========== get user run ==========");
        return new User(this.userId, "小明", System.currentTimeMillis());
    }

    public UserCommand(Integer userId) {
        super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("userGroup")));
        this.userId = userId;
    }

    @Override
    public String getCacheKey() {
        return String.valueOf(userId);
    }

    @Override
    public User getFallback() {
        return new User(9999, "fail", System.currentTimeMillis());
    }
}

View Code

注解方式实现：

注解	描述	属性
@CacheResult	标记请求命令返回的结果需要被缓存，必须与@HystrixCommand注解结合使用	cacheKeyMethod
@CacheRemove	标记请求命令缓存失效，失效的缓存是根据定义的Key决定的	commandKey, cacheKeyMethod
@CacheKey	在请求命令参数上使用，使参数作为缓存的Key值，如果没有标注则使用所有的参数。如果同时使用@CacheResult和@CacheRemove注解的cacheKeyMethod方法指定缓存Key的生成，那么@CacheKey注解不会起作用	value

注解方式示例：

@Service
public class UserService {
    @HystrixCommand(fallbackMethod = "getFallback")
    @CacheResult(cacheKeyMethod = "getCacheKey")
    public User getUserByUserId(Integer userId) {
        System.out.println("========== UserService get user run ==========");
        return new User(userId, "小明", System.currentTimeMillis());
    }

    public String getCacheKey(Integer userId) {
        return String.valueOf(userId);
    }

    public User getFallback(Integer userId) {
        return new User(9999, "fail", System.currentTimeMillis());
    }
}

View Code

请求缓存示例：

    @Autowired
    private UserService userService;

    @GetMapping("/getUserCommandByUserId/{userId}")
    public String getUserCommand(@PathVariable Integer userId) {
        HystrixRequestContext.initializeContext();

        System.out.println(new UserCommand(userId).execute().toString());
        System.out.println(new UserCommand(userId).execute().toString());
        System.out.println(new UserCommand(userId).execute().toString());
        System.out.println(new UserCommand(userId).execute().toString());

        return new UserCommand(userId).execute().toString();
    }

    @GetMapping("/getUserByUserId/{userId}")
    public String getUserByUserId(@PathVariable Integer userId) {
        HystrixRequestContext.initializeContext();

        System.out.println(userService.getUserByUserId(userId));
        System.out.println(userService.getUserByUserId(userId));
        System.out.println(userService.getUserByUserId(userId));

        return userService.getUserByUserId(userId).toString();
    }

View Code

报错（HystrixRequestContext.initializeContext()）：

java.util.concurrent.ExecutionException: Observable onError
Caused by: java.lang.IllegalStateException: Request caching is not available. Maybe you need to initialize the HystrixRequestContext?

构建context，如果请求B要用到请求A的结果缓存，A和B必须同处一个context。

5、hystrix 配置参数

Execution相关的属性的配置：

hystrix.command.default.execution.isolation.strategy 隔离策略，默认是Thread, 可选Thread｜Semaphore

hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds 命令执行超时时间，默认1000ms

hystrix.command.default.execution.timeout.enabled 执行是否启用超时，默认启用true
hystrix.command.default.execution.isolation.thread.interruptOnTimeout 发生超时是是否中断，默认true
hystrix.command.default.execution.isolation.semaphore.maxConcurrentRequests 最大并发请求数，默认10，该参数当使用ExecutionIsolationStrategy.SEMAPHORE策略时才有效。如果达到最大并发请求数，请求会被拒绝。理论上选择semaphore size的原则和选择thread size一致，但选用semaphore时每次执行的单元要比较小且执行速度快（ms级别），否则的话应该用thread。
semaphore应该占整个容器（tomcat）的线程池的一小部分。

Fallback相关的属性

这些参数可以应用于Hystrix的THREAD和SEMAPHORE策略

hystrix.command.default.fallback.isolation.semaphore.maxConcurrentRequests 如果并发数达到该设置值，请求会被拒绝和抛出异常并且fallback不会被调用。默认10
hystrix.command.default.fallback.enabled 当执行失败或者请求被拒绝，是否会尝试调用hystrixCommand.getFallback() 。默认true

Circuit Breaker相关的属性

hystrix.command.default.circuitBreaker.enabled 用来跟踪circuit的健康性，如果未达标则让request短路。默认true
hystrix.command.default.circuitBreaker.requestVolumeThreshold 一个rolling window内最小的请求数。如果设为20，那么当一个rolling window的时间内（比如说1个rolling window是10秒）收到19个请求，即使19个请求都失败，也不会触发circuit break。默认20
hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds 触发短路的时间值，当该值设为5000时，则当触发circuit break后的5000毫秒内都会拒绝request，也就是5000毫秒后才会关闭circuit。默认5000
hystrix.command.default.circuitBreaker.errorThresholdPercentage错误比率阀值，如果错误率>=该值，circuit会被打开，并短路所有请求触发fallback。默认50
hystrix.command.default.circuitBreaker.forceOpen 强制打开熔断器，如果打开这个开关，那么拒绝所有request，默认false
hystrix.command.default.circuitBreaker.forceClosed 强制关闭熔断器 如果这个开关打开，circuit将一直关闭且忽略circuitBreaker.errorThresholdPercentage

Metrics相关参数

hystrix.command.default.metrics.rollingStats.timeInMilliseconds 设置统计的时间窗口值的，毫秒值，circuit break 的打开会根据1个rolling window的统计来计算。若rolling window被设为10000毫秒，则rolling window会被分成n个buckets，每个bucket包含success，failure，timeout，rejection的次数的统计信息。默认10000
hystrix.command.default.metrics.rollingStats.numBuckets 设置一个rolling window被划分的数量，若numBuckets＝10，rolling window＝10000，那么一个bucket的时间即1秒。必须符合rolling window % numberBuckets == 0。默认10
hystrix.command.default.metrics.rollingPercentile.enabled 执行时是否enable指标的计算和跟踪，默认true
hystrix.command.default.metrics.rollingPercentile.timeInMilliseconds 设置rolling percentile window的时间，默认60000
hystrix.command.default.metrics.rollingPercentile.numBuckets 设置rolling percentile window的numberBuckets。逻辑同上。默认6
hystrix.command.default.metrics.rollingPercentile.bucketSize 如果bucket size＝100，window＝10s，若这10s里有500次执行，只有最后100次执行会被统计到bucket里去。增加该值会增加内存开销以及排序的开销。默认100
hystrix.command.default.metrics.healthSnapshot.intervalInMilliseconds 记录health 快照（用来统计成功和错误绿）的间隔，默认500ms

其他配置

hystrix.command.default和hystrix.threadpool.default中的default为默认CommandKey

Request Context 相关参数
hystrix.command.default.requestCache.enabled 默认true，需要重载getCacheKey()，返回null时不缓存
hystrix.command.default.requestLog.enabled 记录日志到HystrixRequestLog，默认true

Collapser Properties 相关参数
hystrix.collapser.default.maxRequestsInBatch 单次批处理的最大请求数，达到该数量触发批处理，默认Integer.MAX_VALUE
hystrix.collapser.default.timerDelayInMilliseconds 触发批处理的延迟，也可以为创建批处理的时间＋该值，默认10
hystrix.collapser.default.requestCache.enabled 是否对HystrixCollapser.execute() and HystrixCollapser.queue()的cache，默认true

ThreadPool 相关参数
线程数默认值10适用于大部分情况（有时可以设置得更小），如果需要设置得更大，那有个基本得公式可以follow：
requests per second at peak when healthy × 99th percentile latency in seconds + some breathing room
每秒最大支撑的请求数 (99%平均响应时间 + 缓存值)
比如：每秒能处理1000个请求，99%的请求响应时间是60ms，那么公式是：
1000 （0.060+0.012）

基本得原则时保持线程池尽可能小，他主要是为了释放压力，防止资源被阻塞。
当一切都是正常的时候，线程池一般仅会有1到2个线程激活来提供服务

hystrix.threadpool.default.coreSize 并发执行的最大线程数，默认10
hystrix.threadpool.default.maxQueueSize BlockingQueue的最大队列数，当设为－1，会使用SynchronousQueue，值为正时使用LinkedBlcokingQueue。该设置只会在初始化时有效，之后不能修改threadpool的queue size，除非reinitialising thread executor。默认－1。
hystrix.threadpool.default.queueSizeRejectionThreshold 即使maxQueueSize没有达到，达到queueSizeRejectionThreshold该值后，请求也会被拒绝。因为maxQueueSize不能被动态修改，这个参数将允许我们动态设置该值。if maxQueueSize == -1，该字段将不起作用
hystrix.threadpool.default.keepAliveTimeMinutes 如果corePoolSize和maxPoolSize设成一样（默认实现）该设置无效。如果通过plugin（https://github.com/Netflix/Hystrix/wiki/Plugins）使用自定义实现，该设置才有用，默认1.
hystrix.threadpool.default.metrics.rollingStats.timeInMilliseconds 线程池统计指标的时间，默认10000
hystrix.threadpool.default.metrics.rollingStats.numBuckets 将rolling window划分为n个buckets，默认10