k8s 集群重启后无法正常启动etc,apiserver
kubeadm安装集群
问题原因: 服务器因硬件故障,强制重启服务器 (k8s在运行状态下强行停止)导致etcd数据损坏导致
问题复盘:
启动服务器后登录到k8s-master节点上查看node运行状态
[root@k8s-master ~]# kubectl get nodes
The connection to the server 192.168.110.21:6443 was refused - did you specify the right host or port?
[root@k8s-master ~]# journalctl -xefu kubelet 查看kubelet详细输出日志

显示master节点上访问6443集群失败

查看端口未启动状态
查看docker所运行容器的状态
[root@k8s-master ~]# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a97f61ac0346 0369cf4303ff "etcd --advertise-cl¡" 42 seconds ago Exited (2) 39 seconds ago k8s_etcd_etcd-k8s-master_kube-system_31ea958b9cdd36ecc498b04d203ffbca_52 927d08edee07 b05d611c1af9 "kube-apiserver --ad¡" About a minute ago Exited (1) 48 seconds ago k8s_kube-apiserver_kube-apiserver-k8s-master_kube-system_5a71cb38b70e77a89f0bd4798e0c1566_50 820c42fad8e2 b93ab2ec4475 "kube-scheduler --au¡" 6 minutes ago Up 6 minutes k8s_kube-scheduler_kube-scheduler-k8s-master_kube-system_16980849e13ae2e69581aae0f2d57229_8 472f82f8fcc2 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 6 minutes ago Up 6 minutes k8s_POD_kube-scheduler-k8s-master_kube-system_16980849e13ae2e69581aae0f2d57229_8 2fcef5bf0dd5 560dd11d4550 "kube-controller-man¡" 6 minutes ago Up 6 minutes k8s_kube-controller-manager_kube-controller-manager-k8s-master_kube-system_aa4a74d86725caefdd42727928a38cf0_7 f168ec42027f registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 6 minutes ago Up 6 minutes k8s_POD_kube-controller-manager-k8s-master_kube-system_aa4a74d86725caefdd42727928a38cf0_7 5d84e3920068 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 6 minutes ago Up 6 minutes k8s_POD_kube-apiserver-k8s-master_kube-system_5a71cb38b70e77a89f0bd4798e0c1566_6 1e1b9d311a19 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 6 minutes ago Up 6 minutes k8s_POD_etcd-k8s-master_kube-system_31ea958b9cdd36ecc498b04d203ffbca_6 1c9b0bea920f 560dd11d4550 "kube-controller-man¡" 32 minutes ago Exited (2) 7 minutes ago k8s_kube-controller-manager_kube-controller-manager-k8s-master_kube-system_aa4a74d86725caefdd42727928a38cf0_6 68a887023f35 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 32 minutes ago Exited (0) 7 minutes ago k8s_POD_kube-controller-manager-k8s-master_kube-system_aa4a74d86725caefdd42727928a38cf0_6 4049a51c920c b93ab2ec4475 "kube-scheduler --au¡" 32 minutes ago Exited (2) 7 minutes ago k8s_kube-scheduler_kube-scheduler-k8s-master_kube-system_16980849e13ae2e69581aae0f2d57229_7 5a10f55dea26 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 32 minutes ago Exited (0) 7 minutes ago k8s_POD_kube-scheduler-k8s-master_kube-system_16980849e13ae2e69581aae0f2d57229_7 c37306821358 bfe3a36ebd25 "/coredns -conf /etc¡" 5 days ago Exited (255) 5 hours ago k8s_coredns_coredns-7f89b7bc75-wjwkw_kube-system_ab0ed17b-ffa1-47e1-9088-e5161624487f_0 dee526fc0c1a 9a154323fbf7 "/usr/bin/kube-contr¡" 5 days ago Exited (255) 5 hours ago k8s_calico-kube-controllers_calico-kube-controllers-6949477b58-88m7w_kube-system_870adbdc-3533-4500-8296-fb567fc73fb3_0 01c6be8fa09b bfe3a36ebd25 "/coredns -conf /etc¡" 5 days ago Exited (255) 5 hours ago k8s_coredns_coredns-7f89b7bc75-wlmfm_kube-system_e922a0cd-00b2-42b8-9f47-5762987e9cf4_0 a6f4df56d5f1 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 5 days ago Exited (0) 4 hours ago k8s_POD_coredns-7f89b7bc75-wjwkw_kube-system_ab0ed17b-ffa1-47e1-9088-e5161624487f_0 9abb84c13213 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 5 days ago Exited (0) 4 hours ago k8s_POD_coredns-7f89b7bc75-wlmfm_kube-system_e922a0cd-00b2-42b8-9f47-5762987e9cf4_0 ac03b10e7b9b registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 5 days ago Exited (0) 4 hours ago k8s_POD_calico-kube-controllers-6949477b58-88m7w_kube-system_870adbdc-3533-4500-8296-fb567fc73fb3_0 43a86e8452ed 8b8b1eb786b5 "catalina.sh run" 10 days ago Exited (255) 5 hours ago k8s_tomcat-pod-java_demo-pod-k8s-master_default_11e914d3386adf4e1a88ac082cd9fa1b_0 7b9a2e86f0cb registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 10 days ago Exited (0) 4 hours ago k8s_POD_demo-pod-k8s-master_default_11e914d3386adf4e1a88ac082cd9fa1b_1 f3a63fb2582a 5a7c4970fbc2 "start_runit" 10 days ago Exited (255) 5 hours ago k8s_calico-node_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0 97e13c6f2c28 2a22066e9588 "/usr/local/bin/flex¡" 10 days ago Exited (128) 10 days ago k8s_flexvol-driver_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0 2afc79d79d47 727de170e4ce "/opt/cni/bin/install" 10 days ago Exited (128) 10 days ago k8s_install-cni_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0 be9aae69c09c 727de170e4ce "/opt/cni/bin/calico¡" 10 days ago Exited (128) 10 days ago k8s_upgrade-ipam_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0 7611f9c86ef5 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 10 days ago Exited (0) 4 hours ago k8s_POD_calico-node-p4nd4_kube-system_830c5064-bd72-49e0-ab1e-fb8aaeb1f034_0 385ed288bd08 9a1ebfd8124d "/usr/local/bin/kube¡" 10 days ago Exited (255) 5 hours ago k8s_kube-proxy_kube-proxy-dg9vg_kube-system_5ac97ded-bfe7-4105-9c46-97f581229e2c_0 475a79728800 registry.aliyuncs.com/google_containers/pause:3.2 "/pause" 10 days ago Exited (0) 4 hours ago k8s_POD_kube-proxy-dg9vg_kube-system_5ac97ded-bfe7-4105-9c46-97f581229e2c_0
只要少数的组件运行 其他EXited状态
解决方法/步骤
依次排查k8s集群中重要组件容器
1.排查kube-apiserver组件容器,通过docker ps -a可以看到api现在处于exit退出状态

查看apiserver服务容器的启动日志, 发现又出现报错Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused,2379是etcd的端口,那么apiserver是由于etcd无法连接而启动不了
2.接着查看etcd的启动日志,发现报错mvcc: cannot unmarshal event: proto: wrong wireType = 0 for field Key。经查询资料,此报错是由于服务器非正常关机(意外掉电,强制拔电)后 etcd数据损坏导致的,这个节点之前确实是出现异常关机,etcd无法启动,那么解决此问题就行了

3.按照指导进行操作,在故障节点上停止etcd服务并删除损坏的 etcd 数据,现在etcd服务本来就没有启动,删除前先备份数据,最后启动etcd服务
容器的数据在/var/lib目录下,按照下图操作

4.最后先启动etcd服务,然后启动api-server,执行kubectl get nodes后可以正常显示节点状态,问题解决。


查看集群节点
该方式会把你所有的pod都丢失掉,建议部署etcd集群。
[root@k8s-master ~]# kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-master Ready <none> 37s v1.20.6 k8s-node1 Ready <none> 37s v1.20.6 k8s-node2 Ready <none> 37s v1.20.6 [root@k8s-master ~]# kubectl get nodes -n kube-system -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8s-master Ready <none> 2m17s v1.20.6 192.168.110.21 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://20.10.6 k8s-node1 Ready <none> 2m17s v1.20.6 192.168.110.22 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://20.10.6 k8s-node2 Ready <none> 2m17s v1.20.6 192.168.110.23 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://20.10.6r
如果有备份文件
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \ --data-dir=/var/lib/etcd snapshot restore /wks/etcd-20220905.db
- 停止所有 API 服务实例
- 在所有 etcd 实例中恢复状态
- 重启所有 API 服务实例

浙公网安备 33010602011771号