一次 Kubernetes 集群故障的处理过程记录

昨天在一个高可用集群中添加一台 control-plane 时造成 etcd 无法启动,引发集群故障,在这篇博文中记录一下故障处理过程。

Kubernetes 版本是 1.24,加入前集群中只有1台 control-plane,主机名是 kube-master0,待加入的 control-plane 主机名是 kube-master1。

control-plane 加入集群的命令如下,详见 https://q.cnblogs.com/q/139137/

kubeadm join k8s-api:6443 \
  --token ****** \
  --discovery-token-ca-cert-hash ****** \
  --control-plane \
  --certificate-key *****

故障出现在 etcd 加入集群的阶段

[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[kubelet-check] Initial timeout of 40s passed.

出现故障后 kube-master0 上的 etcd 与 api-server 都无法正常启动。

通过下面的命令手动启动 etcd

  • etcd 端口号改成了以3开头,以免与已有的 etcd 端口号冲突
  • nerdctl 是 containerd 的 cli 工具(兼容 docker 命令行语法)
nerdctl run --network host -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd -it registry.aliyuncs.com/google_containers/etcd:3.5.3-0 etcd \
   --advertise-client-urls=https://10.0.9.171:3379 \
   --cert-file=/etc/kubernetes/pki/etcd/server.crt \
   --client-cert-auth=true \
   --data-dir=/var/lib/etcd \
   --experimental-initial-corrupt-check=true \
   --initial-advertise-peer-urls=https://10.0.9.171:3380 \
   --initial-cluster=kube-master0=https://10.0.9.171:3380 \
   --key-file=/etc/kubernetes/pki/etcd/server.key \
   --listen-client-urls=https://127.0.0.1:3379,https://10.0.9.171:3379 \
   --listen-metrics-urls=http://127.0.0.1:3381 \
   --listen-peer-urls=https://10.0.9.171:3380 \
   --name=kube-master0 \
   --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
   --peer-client-cert-auth=true \
   --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
   --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
   --snapshot-count=10000 \
   --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

然后可以通过 etcdctl 命令查看 etcd 成员信息

nerdctl exec -it b670f6396b5a etcdctl --endpoints 127.0.0.1:3379 \                                                                                                                1 ↵
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/server.crt \
  --key /etc/kubernetes/pki/etcd/server.key member list -w table

member list 结果显示只有1个 kube-master0

+------------------+---------+--------------+-------------------------+-------------------------+------------+
|        ID        | STATUS  |     NAME     |       PEER ADDRS        |      CLIENT ADDRS       | IS LEARNER |
+------------------+---------+--------------+-------------------------+-------------------------+------------+
| 1a4da1e7353311e6 | started | kube-master0 | https://10.0.9.171:3380 | https://10.0.9.171:3379 |      false |
+------------------+---------+--------------+-------------------------+-------------------------+------------+

本来还以为可能是另外一台 control-plane(kube-master1) 加入到 etcd 集群造成的问题,如果是这样,可以通过 member remove 命令移除,但现在只有1台 kube-master0,怎么会无法启动?

从日志中找找线索,在 /var/log/containers/ 中发现 etcd- 开始的日志文件,从中找到了重要线索:

2022-05-19T22:09:55.110249318+08:00 stderr F {"level":"info","ts":"2022-05-19T14:09:55.110Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"896d19d1d0a08f49","remote-peer-id":"ac17da10883377fc","remote-peer-urls":["https://10.0.9.215:2380"]}

10.0.9.215 就是 kube-master1 的 IP 地址,但让人纳闷的是 member list 中并没有这个 IP,为什么还要添加这个 peer?

回到 etcd,进一步用 etcdctl get /registry --prefix --keys-only 命令查看,结果竟然为空,etcd 中没有 k8s 集群的数据,奇怪。

继续从日志找线索,仔细查看 etcd 容器的启动日志,发现下面一个参数:

"force-new-cluster":false

通过 etcd 官网文档了解到这个参数的用途:

start etcd with the --force-new-cluster option and pointing to the backup directory. This will initialize a new, single-member cluster with the default advertised peer URLs, but preserve the entire contents of the etcd data store.

立马看到希望,将 force-new-cluster 改为 true 试试。

打开 etcd.yaml

vi /etc/kubernetes/manifests/etcd.yaml

在 command 中加入

spec:
  containers:
  - command:
    - etcd
    # ... 
    - --force-new-cluster

重启 kubelet

systemctl start kubelet

然后,奇迹会出现了,etcd 很快成功启动,集群很快恢复正常!

收尾:去掉刚刚添加的 force-new-cluster 参数。

posted @ 2022-05-20 15:17  dudu  阅读(627)  评论(0编辑  收藏  举报