【实验】envoy 中的“Failover”(故障转移)

作者:张富春(ahfuzhang),转载时请注明作者和引用链接,谢谢!


先说结论:

envoy 中转发时一定要配置 retry_policy.num_retries>0,否则一遇到后端不可用就会出现 503 错误

下面是实验的步骤:

1. 搭建 http echo 服务器后端

docker run -it --rm --name aaa -p 8081:5678 hashicorp/http-echo -text="hello from a"

docker run --rm --name bbb -p 8082:5678 hashicorp/http-echo -text="hello from b"

提供 a,b 两个后端,看看 envoy 的转发效果。

2. 启动 envoy 代理服务

  • 先准备 envoy 的配置文件
static_resources:
  listeners:
  - name: listener_4000
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 4000  # 监听的端口
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: backend_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: backend_cluster
                  retry_policy:
                    retry_on: connect-failure,refused-stream,5xx,reset
                    num_retries: 1
                    per_try_timeout: 2s
                    retry_back_off:
                      base_interval: 3s
                      max_interval: 60s

                response_headers_to_add:
                - header:
                    key: "X-Envoy-Flags"
                    value: "%RESPONSE_FLAGS%"
                - header:
                    key: "X-Upstream-Host"
                    value: "%UPSTREAM_HOST%"

          http_filters:
          - name: envoy.filters.http.router
            typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          access_log:
          - name: envoy.access_loggers.stdout
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
              log_format:
                text_format: >
                  [time: %START_TIME%]
                  %DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%
                  [%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%]
                  status=%RESPONSE_CODE%
                  flags=%RESPONSE_FLAGS%
                  duration=%DURATION%ms
                  upstream_host=%UPSTREAM_HOST%
                  upstream_cluster=%UPSTREAM_CLUSTER%
                  upstream_local_address=%UPSTREAM_LOCAL_ADDRESS%
                  upstream_transport_failure=%UPSTREAM_TRANSPORT_FAILURE_REASON%
                  downstream=%DOWNSTREAM_REMOTE_ADDRESS%
                  user-agent="%REQ(USER-AGENT)%"
                  xff="%REQ(X-FORWARDED-FOR)%"
                  authority="%REQ(:AUTHORITY)%"
                  request_id="%REQ(X-REQUEST-ID)%"
                  \n


  clusters:
  - name: backend_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: backend_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: host.docker.internal
                port_value: 8081  # 后端 A
        - endpoint:
            address:
              socket_address:
                address: host.docker.internal
                port_value: 8082  # 后端 B

admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

  • 启动 envoy:
docker run --rm -it \
  -v $(pwd)/envoy.yaml:/etc/envoy/envoy.yaml \
  -p 4000:4000 \
  -p 9901:9901 \
  envoyproxy/envoy:v1.29.1
  • 测试
time curl -G "http://127.0.0.1:4000/?a=333" -v

可以看见内容:

> GET /?a=333 HTTP/1.1
> Host: 127.0.0.1:4000
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< x-app-name: http-echo
< x-app-version: 1.0.0
< date: Fri, 28 Nov 2025 03:14:45 GMT
< content-length: 13
< content-type: text/plain; charset=utf-8
< x-envoy-upstream-service-time: 3  # 这一行是转发的时间
< x-envoy-flags: -
< x-upstream-host: 192.168.65.254:8081
< server: envoy
<
hello from a

3. 【实验一】不进行 retry

yaml 配置中修改为:

                  retry_policy:
                    retry_on: connect-failure,refused-stream,5xx,reset
                    num_retries: 0

在访问之前立即停掉一个后端,看看效果:

docker stop aaa && time curl -G "http://127.0.0.1:4000/?a=333" -v

客户端可以看到如下信息:

> GET /?a=333 HTTP/1.1
> Host: 127.0.0.1:4000
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 503 Service Unavailable
< x-envoy-flags: UF,URX
< x-upstream-host: 192.168.65.254:8081
< content-length: 152
< content-type: text/plain
< date: Fri, 28 Nov 2025 03:19:59 GMT
< server: envoy
<
* Connection #0 to host 127.0.0.1 left intact
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111
  • envoy 的日志如下:
[time: 2025-11-28T03:19:59.372Z] 192.168.65.1 [GET /?a=333 HTTP/1.1] status=503 flags=UF,URX duration=3ms upstream_host=192.168.65.254:8081 upstream_cluster=backend_cluster upstream_local_address=- upstream_transport_failure=delayed_connect_error:_111 downstream=192.168.65.1:64080 user-agent="curl/8.7.1" xff="-" authority="127.0.0.1:4000" request_id="e53800ed-ed82-47c5-9bc7-6b640ef5a22f"

可以看见, envoy 并未取到故障转移的效果。
flags=UF,URX 表明发生了错误:

  • UF Upstream connection failure TCP 三次握手失败,无法建立连接
  • URX Upstream Remote Connection Closed

4. 【实验 二】使用 retry

yaml 的配置修改为:

                  retry_policy:
                    retry_on: connect-failure,refused-stream,5xx,reset
                    num_retries: 1
                    per_try_timeout: 2s
                    retry_back_off:
                      base_interval: 3s
                      max_interval: 60s

为了看出重试的效果,故意把重试的延迟设置得很大。如果返回很慢,就知道 envoy 端发生了故障转移。

重启 envoy,然后执行客户端命令:

docker stop aaa && time curl -G "http://127.0.0.1:4000/?a=333" -v

得到结果如下:

> GET /?a=333 HTTP/1.1
> Host: 127.0.0.1:4000
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< x-app-name: http-echo
< x-app-version: 1.0.0
< date: Fri, 28 Nov 2025 03:30:58 GMT
< content-length: 13
< content-type: text/plain; charset=utf-8
< x-envoy-upstream-service-time: 628
< x-envoy-flags: -
< x-upstream-host: 192.168.65.254:8082
< server: envoy
<
hello from b

从 x-envoy-upstream-service-time: 628 发现,延迟明显变长了。

envoy 的日志:

[time: 2025-11-28T03:30:57.738Z] 192.168.65.1 [GET /?a=333 HTTP/1.1] status=200 flags=- duration=629ms upstream_host=192.168.65.254:8082 upstream_cluster=backend_cluster upstream_local_address=172.17.0.3:53022 upstream_transport_failure=- downstream=192.168.65.1:32245 user-agent="curl/8.7.1" xff="-" authority="127.0.0.1:4000" request_id="1c693834-1bc1-4f98-8075-ca8be041a4c9"

日志一切正常,看不出发生过 retry,只能从延迟来感知发生了 retry.

Have fun. 😃

posted on 2025-11-28 11:33  ahfuzhang  阅读(0)  评论(0)    收藏  举报