k8s 集群重启后无法正常启动etc,apiserver

kubeadm安装集群

问题原因: 服务器因硬件故障,强制重启服务器 (k8s在运行状态下强行停止)导致etcd数据损坏导致

问题复盘:

启动服务器后登录到k8s-master节点上查看node运行状态

[root@k8s-master ~]# kubectl get nodes

The connection to the server 192.168.110.21:6443 was refused - did you specify the right host or port?

[root@k8s-master ~]#  journalctl -xefu kubelet 查看kubelet详细输出日志

 显示master节点上访问6443集群失败

 查看端口未启动状态

查看docker所运行容器的状态

[root@k8s-master ~]# docker ps -a
CONTAINER ID   IMAGE                                               COMMAND                  CREATED              STATUS                      PORTS     NAMES
a97f61ac0346   0369cf4303ff                                        "etcd --advertise-cl¡­"   42 seconds ago       Exited (2) 39 seconds ago             k8s_etcd_etcd-k8s-master_kube-system_31ea958b9cdd36ecc498b04d203ffbca_52
927d08edee07   b05d611c1af9                                        "kube-apiserver --ad¡­"   About a minute ago   Exited (1) 48 seconds ago             k8s_kube-apiserver_kube-apiserver-k8s-master_kube-system_5a71cb38b70e77a89f0bd4798e0c1566_50
820c42fad8e2   b93ab2ec4475                                        "kube-scheduler --au¡­"   6 minutes ago        Up 6 minutes                          k8s_kube-scheduler_kube-scheduler-k8s-master_kube-system_16980849e13ae2e69581aae0f2d57229_8
472f82f8fcc2   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 6 minutes ago        Up 6 minutes                          k8s_POD_kube-scheduler-k8s-master_kube-system_16980849e13ae2e69581aae0f2d57229_8
2fcef5bf0dd5   560dd11d4550                                        "kube-controller-man¡­"   6 minutes ago        Up 6 minutes                          k8s_kube-controller-manager_kube-controller-manager-k8s-master_kube-system_aa4a74d86725caefdd42727928a38cf0_7
f168ec42027f   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 6 minutes ago        Up 6 minutes                          k8s_POD_kube-controller-manager-k8s-master_kube-system_aa4a74d86725caefdd42727928a38cf0_7
5d84e3920068   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 6 minutes ago        Up 6 minutes                          k8s_POD_kube-apiserver-k8s-master_kube-system_5a71cb38b70e77a89f0bd4798e0c1566_6
1e1b9d311a19   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 6 minutes ago        Up 6 minutes                          k8s_POD_etcd-k8s-master_kube-system_31ea958b9cdd36ecc498b04d203ffbca_6
1c9b0bea920f   560dd11d4550                                        "kube-controller-man¡­"   32 minutes ago       Exited (2) 7 minutes ago              k8s_kube-controller-manager_kube-controller-manager-k8s-master_kube-system_aa4a74d86725caefdd42727928a38cf0_6
68a887023f35   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 32 minutes ago       Exited (0) 7 minutes ago              k8s_POD_kube-controller-manager-k8s-master_kube-system_aa4a74d86725caefdd42727928a38cf0_6
4049a51c920c   b93ab2ec4475                                        "kube-scheduler --au¡­"   32 minutes ago       Exited (2) 7 minutes ago              k8s_kube-scheduler_kube-scheduler-k8s-master_kube-system_16980849e13ae2e69581aae0f2d57229_7
5a10f55dea26   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 32 minutes ago       Exited (0) 7 minutes ago              k8s_POD_kube-scheduler-k8s-master_kube-system_16980849e13ae2e69581aae0f2d57229_7
c37306821358   bfe3a36ebd25                                        "/coredns -conf /etc¡­"   5 days ago           Exited (255) 5 hours ago              k8s_coredns_coredns-7f89b7bc75-wjwkw_kube-system_ab0ed17b-ffa1-47e1-9088-e5161624487f_0
dee526fc0c1a   9a154323fbf7                                        "/usr/bin/kube-contr¡­"   5 days ago           Exited (255) 5 hours ago              k8s_calico-kube-controllers_calico-kube-controllers-6949477b58-88m7w_kube-system_870adbdc-3533-4500-8296-fb567fc73fb3_0
01c6be8fa09b   bfe3a36ebd25                                        "/coredns -conf /etc¡­"   5 days ago           Exited (255) 5 hours ago              k8s_coredns_coredns-7f89b7bc75-wlmfm_kube-system_e922a0cd-00b2-42b8-9f47-5762987e9cf4_0
a6f4df56d5f1   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 5 days ago           Exited (0) 4 hours ago                k8s_POD_coredns-7f89b7bc75-wjwkw_kube-system_ab0ed17b-ffa1-47e1-9088-e5161624487f_0
9abb84c13213   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 5 days ago           Exited (0) 4 hours ago                k8s_POD_coredns-7f89b7bc75-wlmfm_kube-system_e922a0cd-00b2-42b8-9f47-5762987e9cf4_0
ac03b10e7b9b   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 5 days ago           Exited (0) 4 hours ago                k8s_POD_calico-kube-controllers-6949477b58-88m7w_kube-system_870adbdc-3533-4500-8296-fb567fc73fb3_0
43a86e8452ed   8b8b1eb786b5                                        "catalina.sh run"        10 days ago          Exited (255) 5 hours ago              k8s_tomcat-pod-java_demo-pod-k8s-master_default_11e914d3386adf4e1a88ac082cd9fa1b_0
7b9a2e86f0cb   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 10 days ago          Exited (0) 4 hours ago                k8s_POD_demo-pod-k8s-master_default_11e914d3386adf4e1a88ac082cd9fa1b_1
f3a63fb2582a   5a7c4970fbc2                                        "start_runit"            10 days ago          Exited (255) 5 hours ago              k8s_calico-node_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0
97e13c6f2c28   2a22066e9588                                        "/usr/local/bin/flex¡­"   10 days ago          Exited (128) 10 days ago              k8s_flexvol-driver_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0
2afc79d79d47   727de170e4ce                                        "/opt/cni/bin/install"   10 days ago          Exited (128) 10 days ago              k8s_install-cni_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0
be9aae69c09c   727de170e4ce                                        "/opt/cni/bin/calico¡­"   10 days ago          Exited (128) 10 days ago              k8s_upgrade-ipam_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0
7611f9c86ef5   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 10 days ago          Exited (0) 4 hours ago                k8s_POD_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0
385ed288bd08   9a1ebfd8124d                                        "/usr/local/bin/kube¡­"   10 days ago          Exited (255) 5 hours ago              k8s_kube-proxy_kube-proxy-dg9vg_kube-system_5ac97ded-bfe7-4105-9c46-97f581229e2c_0
475a79728800   registry.aliyuncs.com/google_containers/pause:3.2   "/pause"                 10 days ago          Exited (0) 4 hours ago                k8s_POD_kube-proxy-dg9vg_kube-system_5ac97ded-bfe7-4105-9c46-97f581229e2c_0

 只要少数的组件运行  其他EXited状态

 

解决方法/步骤

依次排查k8s集群中重要组件容器

1.排查kube-apiserver组件容器,通过docker ps -a可以看到api现在处于exit退出状态

查看apiserver服务容器的启动日志, 发现又出现报错Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused,2379是etcd的端口,那么apiserver是由于etcd无法连接而启动不了

 

2.接着查看etcd的启动日志,发现报错mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key。经查询资料,此报错是由于服务器非正常关机(意外掉电,强制拔电)后 etcd数据损坏导致的,这个节点之前确实是出现异常关机,etcd无法启动,那么解决此问题就行了

3.按照指导进行操作,在故障节点上停止etcd服务并删除损坏的 etcd 数据,现在etcd服务本来就没有启动,删除前先备份数据,最后启动etcd服务

容器的数据在/var/lib目录下,按照下图操作

 

4.最后先启动etcd服务,然后启动api-server,执行kubectl get nodes后可以正常显示节点状态,问题解决。

 

查看集群节点

该方式会把你所有的pod都丢失掉,建议部署etcd集群。

[root@k8s-master ~]# kubectl get nodes
NAME         STATUS   ROLES    AGE   VERSION
k8s-master   Ready    <none>   37s   v1.20.6
k8s-node1    Ready    <none>   37s   v1.20.6
k8s-node2    Ready    <none>   37s   v1.20.6
[root@k8s-master ~]# kubectl get nodes -n kube-system -o wide
NAME         STATUS   ROLES    AGE     VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
k8s-master   Ready    <none>   2m17s   v1.20.6   192.168.110.21   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://20.10.6
k8s-node1    Ready    <none>   2m17s   v1.20.6   192.168.110.22   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://20.10.6
k8s-node2    Ready    <none>   2m17s   v1.20.6   192.168.110.23   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://20.10.6r

 

如果有备份文件

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
   --data-dir=/var/lib/etcd  snapshot restore /wks/etcd-20220905.db
 
说明:
如果集群中正在运行任何 API 服务器,则不应尝试还原 etcd 的实例。相反,请按照以下步骤还原 etcd:
  • 停止所有 API 服务实例
  • 在所有 etcd 实例中恢复状态
  • 重启所有 API 服务实例
建议重启所有组件(例如 kube-scheduler、kube-controller-manager、kubelet), 以确保它们不会依赖一些过时的数据
 --data-dir #要恢复的目标路径,不清楚是什么路径可以到kube-apiserver.conf中查看
posted @ 2022-08-08 18:49  Armored-forces  阅读(5769)  评论(0)    收藏  举报