【实验】envoy 中的后端没有 graceful shutdown 时
作者:张富春(ahfuzhang),转载时请注明作者和引用链接,谢谢!
先说结论:
envoy 中转发时配置好 retry 和超时策略的话,就算后端的服务器没有 graceful shutdown,也不会对用户返回任何 503/504 错误
我实现了一个简单的 http 服务器,来模拟一个容器中的服务因为没有正确的 graceful shutdown 而导致的异常行为。
行为上表现为:
- 根本就连不上,可能刚一连接就收到了 tcp reset
- 前一篇帖子已经模拟了这种后端完全销毁的情况。see: 【实验】envoy 中的“Failover”(故障转移)
- 虽然进程的端口仍然有效,但是刚连上就被关闭了 tcp 连接
- 进程的端口正常,并且收到了请求;但是还没回复就关闭了 tcp 连接
- 进程的端口正常,并且收到了请求;虽然连接维持着,但是没有回复任何内容。
各种异常行为的测试结果见下表:
| 具体配置 | 其中一个后端收到请求后保持 TCP 连接,但不返回任何内容 | 其中一个后端收到请求后关闭 TCP 连接 | 其中一个后端 accept 连接后立即关闭 TCP 连接 | 关键配置 |
|---|---|---|---|---|
| num_retries: 0 | 504 Gateway Timeout x-envoy-flags: UT,URX |
503 Service Unavailable x-envoy-flags: UC,URX |
503 Service Unavailable x-envoy-flags: UC,URX |
- |
| num_retries: 1 (没有配置 timeout) |
curl: (28) Operation timed out 客户端超时 |
200 OK failover 成功 |
200 OK failover 成功 |
- |
| num_retries: 1 (配置了 timeout,但是不对) |
504 Gateway Timeout x-envoy-flags: UT |
200 OK failover 成功 |
200 OK failover 成功 |
timeout 总超时时间太短 |
| num_retries: 1 (配置了 timeout) |
200 OK failover 成功 |
200 OK failover 成功 |
200 OK failover 成功 |
timeout: 2.5s num_retries: 1 per_try_timeout: 0.5s |
实验细节
- 用 golang 开发一个 dumb server,来模拟表格中提到的三种异常行为
- 启动另一个正常的 echo 的 http server
- envoy 中把 http 请求转发到上面两个后端
- 验证:当请求发送到异常的后端后,能不能通过正确的配置实现 “Failover”(故障转移)
golang dumb server
源码如下:
package main
import (
"bufio"
"bytes"
"flag"
"fmt"
"io"
"log"
"net"
"strings"
)
var (
addr = flag.String("addr", "0.0.0.0:8080", "listen address")
closeAfterAccept = flag.Bool("close_after_accept", false, "close connection immediately after accept")
)
func main() {
flag.Parse()
ln, err := net.Listen("tcp", *addr)
if err != nil {
log.Fatalf("listen on %s: %v", *addr, err)
}
log.Printf("listening on %s", *addr)
for {
conn, err := ln.Accept()
if err != nil {
log.Printf("accept error: %v", err)
continue
}
if *closeAfterAccept {
conn.Close()
continue
}
go handleConn(conn)
}
}
func handleConn(conn net.Conn) {
defer conn.Close()
reader := bufio.NewReader(conn)
for {
header, err := readHTTPHeader(reader)
if err != nil {
if err != io.EOF {
log.Printf("read header error: %v", err)
}
return
}
path := extractPath(header)
switch path {
case "/dumb":
log.Printf("dumb request from %s, keeping connection open", conn.RemoteAddr())
// Do nothing and wait for more data; upstream should hit a timeout.
continue
case "/dumb_and_close":
log.Printf("dumb_and_close request from %s, closing connection", conn.RemoteAddr())
return
default:
log.Printf("echo request for %s from %s", path, conn.RemoteAddr())
body := header + "addr:" + *addr
resp := fmt.Sprintf("HTTP/1.1 200 OK\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Length: %d\r\nConnection: keep-alive\r\nServer: %s\r\n\r\n%s", len(body), *addr, body)
if _, err := conn.Write([]byte(resp)); err != nil {
log.Printf("write response error: %v", err)
return
}
}
}
}
func readHTTPHeader(r *bufio.Reader) (string, error) {
var buf bytes.Buffer
for {
line, err := r.ReadBytes('\n')
if err != nil {
return "", err
}
buf.Write(line)
if bytes.Contains(buf.Bytes(), []byte("\r\n\r\n")) {
return buf.String(), nil
}
}
}
func extractPath(header string) string {
lines := strings.Split(header, "\r\n")
if len(lines) == 0 {
return ""
}
parts := strings.Split(lines[0], " ")
if len(parts) < 2 {
return ""
}
return parts[1]
}
- go build 编译
- ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=false
- 模拟收到请求的情况
- ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=true
- 模拟tcp 建联后立即关闭的情况
echo server
docker run --rm --name bbb -p 8082:5678 hashicorp/http-echo -text="hello from b"
envoy proxy 配置
yaml 配置如下:
static_resources:
listeners:
- name: listener_4000
address:
socket_address:
address: 0.0.0.0
port_value: 4000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: backend_service
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: backend_cluster
# 本文主要测试下面这部分参数
timeout: 2.5s
retry_policy:
retry_on: connect-failure,refused-stream,5xx,reset
num_retries: 2
per_try_timeout: 0.5s
retry_back_off:
base_interval: 1s
max_interval: 60s
response_headers_to_add:
- header:
key: "X-Envoy-Flags"
value: "%RESPONSE_FLAGS%"
- header:
key: "X-Upstream-Host"
value: "%UPSTREAM_HOST%"
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
log_format:
text_format: >
[time: %START_TIME%]
%DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%
[%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%]
status=%RESPONSE_CODE%
flags=%RESPONSE_FLAGS%
duration=%DURATION%ms
upstream_host=%UPSTREAM_HOST%
upstream_cluster=%UPSTREAM_CLUSTER%
upstream_local_address=%UPSTREAM_LOCAL_ADDRESS%
upstream_transport_failure=%UPSTREAM_TRANSPORT_FAILURE_REASON%
downstream=%DOWNSTREAM_REMOTE_ADDRESS%
user-agent="%REQ(USER-AGENT)%"
xff="%REQ(X-FORWARDED-FOR)%"
authority="%REQ(:AUTHORITY)%"
request_id="%REQ(X-REQUEST-ID)%"
\n
# retry_count=%RETRY_COUNT%
clusters:
- name: backend_cluster
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: backend_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: host.docker.internal
port_value: 8091
- endpoint:
address:
socket_address:
address: host.docker.internal
port_value: 8082
admin:
access_log_path: /dev/null
address:
socket_address:
address: 0.0.0.0
port_value: 9901
- 启动 envoy:
docker run --rm -it \
-v $(pwd)/envoy.yaml:/etc/envoy/envoy.yaml \
-p 4000:4000 \
-p 9901:9901 \
envoyproxy/envoy:v1.29.1
实验步骤
1. 没有 retry 时,后端收到 GET 请求但是不回包
- envoy 配置:
route:
cluster: backend_cluster
retry_policy:
retry_on: connect-failure,refused-stream,5xx,reset
num_retries: 0
per_try_timeout: 2s
retry_back_off:
base_interval: 3s
max_interval: 60s
-
dumb server
- ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=false
-
模拟请求:
time curl "http://127.0.0.1:4000/dumb" -v --max-time 3
- 结果:
- 504 Gateway Timeout
- x-envoy-flags: UT,URX
- UT Upstream Request Timeout
- URX Upstream Remote Connection Closed
2. 没有 retry 时,后端收到 GET 请求后关闭 tcp 连接
- envoy 配置:
route:
cluster: backend_cluster
retry_policy:
retry_on: connect-failure,refused-stream,5xx,reset
num_retries: 0
per_try_timeout: 2s
retry_back_off:
base_interval: 3s
max_interval: 60s
-
dumb server
- ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=false
-
模拟请求:
time curl "http://127.0.0.1:4000/dumb_and_close" -v --max-time 3
- 结果:
- 503 Service Unavailable
- x-envoy-flags: UC,URX
- UC Upstream Connection Termination
- URX Upstream Remote Connection Closed
3. 没有 retry 时,后端 accept 连接后立即关闭连接
- envoy 配置:
route:
cluster: backend_cluster
retry_policy:
retry_on: connect-failure,refused-stream,5xx,reset
num_retries: 0
per_try_timeout: 2s
retry_back_off:
base_interval: 3s
max_interval: 60s
-
dumb server
- ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=true
-
模拟请求:
time curl "http://127.0.0.1:4000/anything" -v --max-time 3
- 结果:
- 503 Service Unavailable
- x-envoy-flags: UC,URX
- UC Upstream Connection Termination
- URX Upstream Remote Connection Closed
4. 有 retry 时,后端收到 GET 请求但是不回包
- envoy 配置:
route:
cluster: backend_cluster
timeout: 2.5s # timeout 是总超时时间,一定要大于 per_try_timeout + retry_back_off.base_interval
retry_policy:
retry_on: connect-failure,refused-stream,5xx,reset
num_retries: 1
per_try_timeout: 0.5s
retry_back_off:
base_interval: 1s
max_interval: 60s
-
dumb server
- ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=false
-
模拟请求:
time curl "http://127.0.0.1:4000/dumb" -v --max-time 3
- 结果:
- 200 OK
其他实验是以上实验方法的组合。
Have fun. 😃

浙公网安备 33010602011771号