处理k8s高可用集群查看资源信息不一致的问题

处理k8s高可用集群查看资源信息不一致的问题

问题:查看集群信息时,会出现信息差异。

[root@k8s-master01 src]# kubectl get pod -A
NAMESPACE     NAME                               READY   STATUS    RESTARTS        AGE
kube-system   cilium-5j5cr                       1/1     Running   0               26m
kube-system   cilium-7svvx                       1/1     Running   3 (7m59s ago)   17m
kube-system   cilium-dqg4k                       1/1     Running   3 (5m9s ago)    17m
kube-system   cilium-envoy-9qxnr                 1/1     Running   0               26m
kube-system   cilium-envoy-blnws                 1/1     Running   0               26m
kube-system   cilium-envoy-jvc6p                 1/1     Running   0               17m
kube-system   cilium-envoy-rjcwk                 1/1     Running   0               26m
kube-system   cilium-envoy-wwqx6                 1/1     Running   0               26m
kube-system   cilium-kd5gq                       1/1     Running   1 (11m ago)     17m
kube-system   cilium-nlwb9                       1/1     Running   2 (8m15s ago)   17m
kube-system   cilium-operator-766b6d6444-s5m4l   1/1     Running   4 (91s ago)     26m
kube-system   cilium-operator-766b6d6444-tds5l   1/1     Running   6 (2m35s ago)   26m
[root@k8s-master01 src]# kubectl get pod -A
NAMESPACE     NAME                               READY   STATUS        RESTARTS        AGE
kube-system   cilium-5j5cr                       0/1     Pending       0               27m
kube-system   cilium-7svvx                       0/1     Pending       0               17m
kube-system   cilium-dqg4k                       0/1     Pending       0               17m
kube-system   cilium-envoy-9qxnr                 1/1     Running       0               27m
kube-system   cilium-envoy-blnws                 0/1     Running       0               27m
kube-system   cilium-envoy-jvc6p                 0/1     Pending       0               17m
kube-system   cilium-envoy-rjcwk                 1/1     Running       0               27m
kube-system   cilium-envoy-wwqx6                 0/1     Pending       0               27m
kube-system   cilium-kd5gq                       1/1     Running       1 (11m ago)     17m
kube-system   cilium-lwqbs                       0/1     Terminating   0               27m
kube-system   cilium-nlwb9                       1/1     Running       2 (8m33s ago)   17m
kube-system   cilium-operator-766b6d6444-s5m4l   1/1     Running       4 (109s ago)    27m
kube-system   cilium-operator-766b6d6444-tds5l   0/1     Pending       0               27m

 

分析:k8s集群中,资源的状态信息都是存在etcd集群中,这情况比较明显是属于etcd数据错乱导致,但是能看到有正常的信息和异常的信息,说明etcd集群有节点的数据是正常的,有节点的数据是错误的。

 

定位问题节点:找出存在异常数据的etcd节点。

  1、停止其中一个节点的etcd。

  2、停止后使用kubectl get pod 验证是否还有异常数据显示的情况。

  3、如果还有异常数据,就启动当前停止的etcd节点,再停止另外节点的etcd,再验证,以此内推。

  4、验证过程中必须保证etcd节点的数量在2个以上。

 

恢复步骤:

  1、停止定位出来的故障etcd服务。(故障节点执行)

systemctl stop etcd

  2、查看被停止的节点。(健康节点执行)

[root@k8s-master01 ~]# etcdctl \
> --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 \
> --cacert=/data/etcd/ssl/ca.pem  \
> --cert=/data/etcd/ssl/etcd.pem  \
> --key=/data/etcd/ssl/etcd-key.pem \
> endpoint health
{"level":"warn","ts":"2025-05-19T09:05:16.851688+0800","logger":"client","caller":"v3@v3.5.15/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000494000/192.168.110.23:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 192.168.110.23:2379: connect: connection refused\""}
https://192.168.110.21:2379 is healthy: successfully committed proposal: took = 23.456088ms
https://192.168.110.22:2379 is healthy: successfully committed proposal: took = 100.589203ms
https://192.168.110.23:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

  3、查找对应的etcd服务ID。(健康节点执行)

[root@k8s-master01 ~]# etcdctl --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 --cacert=/data/etcd/ssl/ca.pem  --cert=/data/etcd/ssl/etcd.pem  --key=/data/etcd/ssl/etcd-key.pem member list
1ef655531ea85a4e, started, k8s-master02, https://192.168.110.22:2380, https://192.168.110.22:2379, false
2eb7522118e4c4da, started, k8s-master03, https://192.168.110.23:2380, https://192.168.110.23:2379, false
b18dd96c7b422fe9, started, k8s-master01, https://192.168.110.21:2380, https://192.168.110.21:2379, false

  3、移除集群中异常的etcd。(健康节点执行)

[root@k8s-master01 ~]# etcdctl --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 --cacert=/data/etcd/ssl/ca.pem  --cert=/data/etcd/ssl/etcd.pem  --key=/data/etcd/ssl/etcd-key.pem member remove 2eb7522118e4c4da
Member 2eb7522118e4c4da removed from cluster 967917bceb4fea7d

[root@k8s-master01 ~]# etcdctl --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 --cacert=/data/etcd/ssl/ca.pem  --cert=/data/etcd/ssl/etcd.pem  --key=/data/etcd/ssl/etcd-key.pem member list
1ef655531ea85a4e, started, k8s-master02, https://192.168.110.22:2380, https://192.168.110.22:2379, false
b18dd96c7b422fe9, started, k8s-master01, https://192.168.110.21:2380, https://192.168.110.21:2379, false

  4、清理异常etcd数据(etcd数据目录:/data/etcd/data)。(故障节点执行)

rm -rf /data/etcd/data/*

  5、修改etcd配置文件。(故障节点执行)(原配置参考:ETCD集群部署

  initial-cluster-state参数的值需要修改为:existing

  6、重新加入etcd集群。(健康节点执行)

[root@k8s-master01 ~]# etcdctl --endpoints=https://192.168.110.21:2379,https://192.168.110.22:2379,https://192.168.110.23:2379 --cacert=/data/etcd/ssl/ca.pem  --cert=/data/etcd/ssl/etcd.pem  --key=/data/etcd/ssl/etcd-key.pem member  add k8s-master03 --peer-urls=https://192.168.110.23:2380 
Member 1028f9549c81c247 added to cluster 967917bceb4fea7d

  7、启动etcd。(故障节点执行)

systemctl start etcd

  

posted @ 2025-05-19 09:44  难止汗  阅读(38)  评论(0)    收藏  举报