【实验】envoy 中的后端没有 graceful shutdown 时

作者:张富春(ahfuzhang),转载时请注明作者和引用链接,谢谢!


先说结论:

envoy 中转发时配置好 retry 和超时策略的话,就算后端的服务器没有 graceful shutdown,也不会对用户返回任何 503/504 错误

我实现了一个简单的 http 服务器,来模拟一个容器中的服务因为没有正确的 graceful shutdown 而导致的异常行为。

行为上表现为:

  • 根本就连不上,可能刚一连接就收到了 tcp reset
  • 虽然进程的端口仍然有效,但是刚连上就被关闭了 tcp 连接
  • 进程的端口正常,并且收到了请求;但是还没回复就关闭了 tcp 连接
  • 进程的端口正常,并且收到了请求;虽然连接维持着,但是没有回复任何内容。

各种异常行为的测试结果见下表:

具体配置 其中一个后端收到请求后保持 TCP 连接,但不返回任何内容 其中一个后端收到请求后关闭 TCP 连接 其中一个后端 accept 连接后立即关闭 TCP 连接 关键配置
num_retries: 0 504 Gateway Timeout
x-envoy-flags: UT,URX
503 Service Unavailable
x-envoy-flags: UC,URX
503 Service Unavailable
x-envoy-flags: UC,URX
-
num_retries: 1
(没有配置 timeout)
curl: (28) Operation timed out
客户端超时
200 OK
failover 成功
200 OK
failover 成功
-
num_retries: 1
(配置了 timeout,但是不对)
504 Gateway Timeout
x-envoy-flags: UT
200 OK
failover 成功
200 OK
failover 成功
timeout 总超时时间太短
num_retries: 1
(配置了 timeout)
200 OK
failover 成功
200 OK
failover 成功
200 OK
failover 成功
timeout: 2.5s
num_retries: 1
per_try_timeout: 0.5s

实验细节

  • 用 golang 开发一个 dumb server,来模拟表格中提到的三种异常行为
  • 启动另一个正常的 echo 的 http server
  • envoy 中把 http 请求转发到上面两个后端
    • 验证:当请求发送到异常的后端后,能不能通过正确的配置实现 “Failover”(故障转移)

golang dumb server

源码如下:

package main

import (
	"bufio"
	"bytes"
	"flag"
	"fmt"
	"io"
	"log"
	"net"
	"strings"
)

var (
	addr             = flag.String("addr", "0.0.0.0:8080", "listen address")
	closeAfterAccept = flag.Bool("close_after_accept", false, "close connection immediately after accept")
)

func main() {
	flag.Parse()

	ln, err := net.Listen("tcp", *addr)
	if err != nil {
		log.Fatalf("listen on %s: %v", *addr, err)
	}
	log.Printf("listening on %s", *addr)

	for {
		conn, err := ln.Accept()
		if err != nil {
			log.Printf("accept error: %v", err)
			continue
		}

		if *closeAfterAccept {
			conn.Close()
			continue
		}

		go handleConn(conn)
	}
}

func handleConn(conn net.Conn) {
	defer conn.Close()

	reader := bufio.NewReader(conn)
	for {
		header, err := readHTTPHeader(reader)
		if err != nil {
			if err != io.EOF {
				log.Printf("read header error: %v", err)
			}
			return
		}

		path := extractPath(header)
		switch path {
		case "/dumb":
			log.Printf("dumb request from %s, keeping connection open", conn.RemoteAddr())
			// Do nothing and wait for more data; upstream should hit a timeout.
			continue
		case "/dumb_and_close":
			log.Printf("dumb_and_close request from %s, closing connection", conn.RemoteAddr())
			return
		default:
			log.Printf("echo request for %s from %s", path, conn.RemoteAddr())
			body := header + "addr:" + *addr
			resp := fmt.Sprintf("HTTP/1.1 200 OK\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Length: %d\r\nConnection: keep-alive\r\nServer: %s\r\n\r\n%s", len(body), *addr, body)
			if _, err := conn.Write([]byte(resp)); err != nil {
				log.Printf("write response error: %v", err)
				return
			}
		}
	}
}

func readHTTPHeader(r *bufio.Reader) (string, error) {
	var buf bytes.Buffer
	for {
		line, err := r.ReadBytes('\n')
		if err != nil {
			return "", err
		}
		buf.Write(line)
		if bytes.Contains(buf.Bytes(), []byte("\r\n\r\n")) {
			return buf.String(), nil
		}
	}
}

func extractPath(header string) string {
	lines := strings.Split(header, "\r\n")
	if len(lines) == 0 {
		return ""
	}
	parts := strings.Split(lines[0], " ")
	if len(parts) < 2 {
		return ""
	}
	return parts[1]
}

  • go build 编译
  • ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=false
    • 模拟收到请求的情况
  • ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=true
    • 模拟tcp 建联后立即关闭的情况

echo server

docker run --rm --name bbb -p 8082:5678 hashicorp/http-echo -text="hello from b"

envoy proxy 配置

yaml 配置如下:

static_resources:
  listeners:
  - name: listener_4000
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 4000
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: backend_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: backend_cluster
                  # 本文主要测试下面这部分参数
                  timeout: 2.5s
                  retry_policy:
                    retry_on: connect-failure,refused-stream,5xx,reset
                    num_retries: 2
                    per_try_timeout: 0.5s
                    retry_back_off:
                      base_interval: 1s
                      max_interval: 60s

                response_headers_to_add:
                - header:
                    key: "X-Envoy-Flags"
                    value: "%RESPONSE_FLAGS%"
                - header:
                    key: "X-Upstream-Host"
                    value: "%UPSTREAM_HOST%"

          http_filters:
          - name: envoy.filters.http.router
            typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          access_log:
          - name: envoy.access_loggers.stdout
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
              log_format:
                text_format: >
                  [time: %START_TIME%]
                  %DOWNSTREAM_REMOTE_ADDRESS_WITHOUT_PORT%
                  [%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%]
                  status=%RESPONSE_CODE%
                  flags=%RESPONSE_FLAGS%
                  duration=%DURATION%ms
                  upstream_host=%UPSTREAM_HOST%
                  upstream_cluster=%UPSTREAM_CLUSTER%
                  upstream_local_address=%UPSTREAM_LOCAL_ADDRESS%
                  upstream_transport_failure=%UPSTREAM_TRANSPORT_FAILURE_REASON%
                  downstream=%DOWNSTREAM_REMOTE_ADDRESS%
                  user-agent="%REQ(USER-AGENT)%"
                  xff="%REQ(X-FORWARDED-FOR)%"
                  authority="%REQ(:AUTHORITY)%"
                  request_id="%REQ(X-REQUEST-ID)%"
                  \n

                  
#                  retry_count=%RETRY_COUNT%

  clusters:
  - name: backend_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: backend_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: host.docker.internal
                port_value: 8091
        - endpoint:
            address:
              socket_address:
                address: host.docker.internal
                port_value: 8082

admin:
  access_log_path: /dev/null
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901
  • 启动 envoy:
docker run --rm -it \
  -v $(pwd)/envoy.yaml:/etc/envoy/envoy.yaml \
  -p 4000:4000 \
  -p 9901:9901 \
  envoyproxy/envoy:v1.29.1

实验步骤

1. 没有 retry 时,后端收到 GET 请求但是不回包

  • envoy 配置:
                route:
                  cluster: backend_cluster
                  retry_policy:
                    retry_on: connect-failure,refused-stream,5xx,reset
                    num_retries: 0
                    per_try_timeout: 2s
                    retry_back_off:
                      base_interval: 3s
                      max_interval: 60s
  • dumb server

    • ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=false
  • 模拟请求:

time curl "http://127.0.0.1:4000/dumb" -v --max-time 3
  • 结果:
    • 504 Gateway Timeout
    • x-envoy-flags: UT,URX
      • UT Upstream Request Timeout
      • URX Upstream Remote Connection Closed

2. 没有 retry 时,后端收到 GET 请求后关闭 tcp 连接

  • envoy 配置:
                route:
                  cluster: backend_cluster
                  retry_policy:
                    retry_on: connect-failure,refused-stream,5xx,reset
                    num_retries: 0
                    per_try_timeout: 2s
                    retry_back_off:
                      base_interval: 3s
                      max_interval: 60s
  • dumb server

    • ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=false
  • 模拟请求:

time curl "http://127.0.0.1:4000/dumb_and_close" -v --max-time 3
  • 结果:
    • 503 Service Unavailable
    • x-envoy-flags: UC,URX
      • UC Upstream Connection Termination
      • URX Upstream Remote Connection Closed

3. 没有 retry 时,后端 accept 连接后立即关闭连接

  • envoy 配置:
                route:
                  cluster: backend_cluster
                  retry_policy:
                    retry_on: connect-failure,refused-stream,5xx,reset
                    num_retries: 0
                    per_try_timeout: 2s
                    retry_back_off:
                      base_interval: 3s
                      max_interval: 60s
  • dumb server

    • ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=true
  • 模拟请求:

time curl "http://127.0.0.1:4000/anything" -v --max-time 3
  • 结果:
    • 503 Service Unavailable
    • x-envoy-flags: UC,URX
      • UC Upstream Connection Termination
      • URX Upstream Remote Connection Closed

4. 有 retry 时,后端收到 GET 请求但是不回包

  • envoy 配置:
                route:
                  cluster: backend_cluster
                  timeout: 2.5s  # timeout 是总超时时间,一定要大于 per_try_timeout + retry_back_off.base_interval
                  retry_policy:
                    retry_on: connect-failure,refused-stream,5xx,reset
                    num_retries: 1
                    per_try_timeout: 0.5s
                    retry_back_off:
                      base_interval: 1s
                      max_interval: 60s
  • dumb server

    • ./dumb_server -addr=0.0.0.0:8091 -close_after_accept=false
  • 模拟请求:

time curl "http://127.0.0.1:4000/dumb" -v --max-time 3
  • 结果:
    • 200 OK

其他实验是以上实验方法的组合。
Have fun. 😃

posted on 2025-11-28 16:00  ahfuzhang  阅读(2)  评论(0)    收藏  举报