【实验】envoy 中的“Failover”(故障转移)
作者:张富春(ahfuzhang),转载时请注明作者和引用链接,谢谢!
先说结论:
envoy 中转发时一定要配置 retry_policy.num_retries>0,否则一遇到后端不可用就会出现 503 错误
下面是实验的步骤:
1. 搭建 http echo 服务器后端
docker run -it --rm --name aaa -p 8081:5678 hashicorp/http-echo -text="hello from a"
docker run --rm --name bbb -p 8082:5678 hashicorp/http-echo -text="hello from b"
提供 a,b 两个后端,看看 envoy 的转发效果。
2. 启动 envoy 代理服务
- 先准备 envoy 的配置文件
static_resources:
listeners:
- name: listener_4000
address:
socket_address:
address: 0.0.0.0
port_value: 4000 # 监听的端口
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: backend_service
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: backend_cluster
retry_policy:
retry_on: connect-failure,refused-stream,5xx,reset
num_retries: 1
per_try_timeout: 2s
retry_back_off:
base_interval: 3s
max_interval: 60s
response_headers_to_add:
- header:
key: "X-Envoy-Flags"
value: "%RESPONSE_FLAGS%"
- header:
key: "X-Upstream-Host"
value: "%UPSTREAM_HOST%"
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
log_format:
text_format: >
[time: %START_TIME%]
%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%
[%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%]
status=%RESPONSE_CODE%
flags=%RESPONSE_FLAGS%
duration=%DURATION%ms
upstream_host=%UPSTREAM_HOST%
upstream_cluster=%UPSTREAM_CLUSTER%
upstream_local_address=%UPSTREAM_LOCAL_ADDRESS%
upstream_transport_failure=%UPSTREAM_TRANSPORT_FAILURE_REASON%
downstream=%DOWNSTREAM_REMOTE_ADDRESS%
user-agent="%REQ(USER-AGENT)%"
xff="%REQ(X-FORWARDED-FOR)%"
authority="%REQ(:AUTHORITY)%"
request_id="%REQ(X-REQUEST-ID)%"
\n
clusters:
- name: backend_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: backend_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: host.docker.internal
port_value: 8081 # 后端 A
- endpoint:
address:
socket_address:
address: host.docker.internal
port_value: 8082 # 后端 B
admin:
access_log_path: /dev/null
address:
socket_address:
address: 0.0.0.0
port_value: 9901
- 启动 envoy:
docker run --rm -it \
-v $(pwd)/envoy.yaml:/etc/envoy/envoy.yaml \
-p 4000:4000 \
-p 9901:9901 \
envoyproxy/envoy:v1.29.1
- 测试
time curl -G "http://127.0.0.1:4000/?a=333" -v
可以看见内容:
> GET /?a=333 HTTP/1.1
> Host: 127.0.0.1:4000
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< x-app-name: http-echo
< x-app-version: 1.0.0
< date: Fri, 28 Nov 2025 03:14:45 GMT
< content-length: 13
< content-type: text/plain; charset=utf-8
< x-envoy-upstream-service-time: 3 # 这一行是转发的时间
< x-envoy-flags: -
< x-upstream-host: 192.168.65.254:8081
< server: envoy
<
hello from a
3. 【实验一】不进行 retry
yaml 配置中修改为:
retry_policy:
retry_on: connect-failure,refused-stream,5xx,reset
num_retries: 0
在访问之前立即停掉一个后端,看看效果:
docker stop aaa && time curl -G "http://127.0.0.1:4000/?a=333" -v
客户端可以看到如下信息:
> GET /?a=333 HTTP/1.1
> Host: 127.0.0.1:4000
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 503 Service Unavailable
< x-envoy-flags: UF,URX
< x-upstream-host: 192.168.65.254:8081
< content-length: 152
< content-type: text/plain
< date: Fri, 28 Nov 2025 03:19:59 GMT
< server: envoy
<
* Connection #0 to host 127.0.0.1 left intact
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111
- envoy 的日志如下:
[time: 2025-11-28T03:19:59.372Z] 192.168.65.1 [GET /?a=333 HTTP/1.1] status=503 flags=UF,URX duration=3ms upstream_host=192.168.65.254:8081 upstream_cluster=backend_cluster upstream_local_address=- upstream_transport_failure=delayed_connect_error:_111 downstream=192.168.65.1:64080 user-agent="curl/8.7.1" xff="-" authority="127.0.0.1:4000" request_id="e53800ed-ed82-47c5-9bc7-6b640ef5a22f"
可以看见, envoy 并未取到故障转移的效果。
从 flags=UF,URX 表明发生了错误:
- UF Upstream connection failure TCP 三次握手失败,无法建立连接
- URX Upstream Remote Connection Closed
4. 【实验 二】使用 retry
yaml 的配置修改为:
retry_policy:
retry_on: connect-failure,refused-stream,5xx,reset
num_retries: 1
per_try_timeout: 2s
retry_back_off:
base_interval: 3s
max_interval: 60s
为了看出重试的效果,故意把重试的延迟设置得很大。如果返回很慢,就知道 envoy 端发生了故障转移。
重启 envoy,然后执行客户端命令:
docker stop aaa && time curl -G "http://127.0.0.1:4000/?a=333" -v
得到结果如下:
> GET /?a=333 HTTP/1.1
> Host: 127.0.0.1:4000
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< x-app-name: http-echo
< x-app-version: 1.0.0
< date: Fri, 28 Nov 2025 03:30:58 GMT
< content-length: 13
< content-type: text/plain; charset=utf-8
< x-envoy-upstream-service-time: 628
< x-envoy-flags: -
< x-upstream-host: 192.168.65.254:8082
< server: envoy
<
hello from b
从 x-envoy-upstream-service-time: 628 发现,延迟明显变长了。
envoy 的日志:
[time: 2025-11-28T03:30:57.738Z] 192.168.65.1 [GET /?a=333 HTTP/1.1] status=200 flags=- duration=629ms upstream_host=192.168.65.254:8082 upstream_cluster=backend_cluster upstream_local_address=172.17.0.3:53022 upstream_transport_failure=- downstream=192.168.65.1:32245 user-agent="curl/8.7.1" xff="-" authority="127.0.0.1:4000" request_id="1c693834-1bc1-4f98-8075-ca8be041a4c9"
日志一切正常,看不出发生过 retry,只能从延迟来感知发生了 retry.
Have fun. 😃

浙公网安备 33010602011771号