处理k8s高可用集群查看资源信息不一致的问题
处理k8s高可用集群查看资源信息不一致的问题
问题:查看集群信息时,会出现信息差异。
[root@k8s-master01 src]# kubectl get pod -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system cilium-5j5cr 1/1 Running 0 26m kube-system cilium-7svvx 1/1 Running 3 (7m59s ago) 17m kube-system cilium-dqg4k 1/1 Running 3 (5m9s ago) 17m kube-system cilium-envoy-9qxnr 1/1 Running 0 26m kube-system cilium-envoy-blnws 1/1 Running 0 26m kube-system cilium-envoy-jvc6p 1/1 Running 0 17m kube-system cilium-envoy-rjcwk 1/1 Running 0 26m kube-system cilium-envoy-wwqx6 1/1 Running 0 26m kube-system cilium-kd5gq 1/1 Running 1 (11m ago) 17m kube-system cilium-nlwb9 1/1 Running 2 (8m15s ago) 17m kube-system cilium-operator-766b6d6444-s5m4l 1/1 Running 4 (91s ago) 26m kube-system cilium-operator-766b6d6444-tds5l 1/1 Running 6 (2m35s ago) 26m [root@k8s-master01 src]# kubectl get pod -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system cilium-5j5cr 0/1 Pending 0 27m kube-system cilium-7svvx 0/1 Pending 0 17m kube-system cilium-dqg4k 0/1 Pending 0 17m kube-system cilium-envoy-9qxnr 1/1 Running 0 27m kube-system cilium-envoy-blnws 0/1 Running 0 27m kube-system cilium-envoy-jvc6p 0/1 Pending 0 17m kube-system cilium-envoy-rjcwk 1/1 Running 0 27m kube-system cilium-envoy-wwqx6 0/1 Pending 0 27m kube-system cilium-kd5gq 1/1 Running 1 (11m ago) 17m kube-system cilium-lwqbs 0/1 Terminating 0 27m kube-system cilium-nlwb9 1/1 Running 2 (8m33s ago) 17m kube-system cilium-operator-766b6d6444-s5m4l 1/1 Running 4 (109s ago) 27m kube-system cilium-operator-766b6d6444-tds5l 0/1 Pending 0 27m
分析:k8s集群中,资源的状态信息都是存在etcd集群中,这情况比较明显是属于etcd数据错乱导致,但是能看到有正常的信息和异常的信息,说明etcd集群有节点的数据是正常的,有节点的数据是错误的。
定位问题节点:找出存在异常数据的etcd节点。
1、停止其中一个节点的etcd。
2、停止后使用kubectl get pod 验证是否还有异常数据显示的情况。
3、如果还有异常数据,就启动当前停止的etcd节点,再停止另外节点的etcd,再验证,以此内推。
4、验证过程中必须保证etcd节点的数量在2个以上。
1、停止定位出来的故障etcd服务。(故障节点执行)
systemctl stop etcd
2、查看被停止的节点。(健康节点执行)
[root@k8s-master01 ~]# etcdctl \ > --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 \ > --cacert=/data/etcd/ssl/ca.pem \ > --cert=/data/etcd/ssl/etcd.pem \ > --key=/data/etcd/ssl/etcd-key.pem \ > endpoint health {"level":"warn","ts":"2025-05-19T09:05:16.851688+0800","logger":"client","caller":"v3@v3.5.15/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000494000/192.168.110.23:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 192.168.110.23:2379: connect: connection refused\""} https://192.168.110.21:2379 is healthy: successfully committed proposal: took = 23.456088ms https://192.168.110.22:2379 is healthy: successfully committed proposal: took = 100.589203ms https://192.168.110.23:2379 is unhealthy: failed to commit proposal: context deadline exceeded Error: unhealthy cluster
3、查找对应的etcd服务ID。(健康节点执行)
[root@k8s-master01 ~]# etcdctl --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem member list 1ef655531ea85a4e, started, k8s-master02, https://192.168.110.22:2380, https://192.168.110.22:2379, false 2eb7522118e4c4da, started, k8s-master03, https://192.168.110.23:2380, https://192.168.110.23:2379, false b18dd96c7b422fe9, started, k8s-master01, https://192.168.110.21:2380, https://192.168.110.21:2379, false
3、移除集群中异常的etcd。(健康节点执行)
[root@k8s-master01 ~]# etcdctl --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem member remove 2eb7522118e4c4da Member 2eb7522118e4c4da removed from cluster 967917bceb4fea7d [root@k8s-master01 ~]# etcdctl --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem member list 1ef655531ea85a4e, started, k8s-master02, https://192.168.110.22:2380, https://192.168.110.22:2379, false b18dd96c7b422fe9, started, k8s-master01, https://192.168.110.21:2380, https://192.168.110.21:2379, false
4、清理异常etcd数据(etcd数据目录:/data/etcd/data)。(故障节点执行)
rm -rf /data/etcd/data/*
5、修改etcd配置文件。(故障节点执行)(原配置参考:ETCD集群部署)
initial-cluster-state参数的值需要修改为:existing
6、重新加入etcd集群。(健康节点执行)
[root@k8s-master01 ~]# etcdctl --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 --cacert=/data/etcd/ssl/ca.pem --cert=/data/etcd/ssl/etcd.pem --key=/data/etcd/ssl/etcd-key.pem member add k8s-master03 --peer-urls=https://192.168.110.23:2380 Member 1028f9549c81c247 added to cluster 967917bceb4fea7d
7、启动etcd。(故障节点执行)
systemctl start etcd

浙公网安备 33010602011771号