打赏

kubernetes集群故障恢复

前提概要:该k8s集群为测试集群

故障报错1:

 排障:

查询kube-apiserver服务状态:

 可以看出cni使用了docker和cri-dockerd两种,所以涉及:unix:///run/containerd/containerd.sock unix:///var/run/cri-dockerd.sock两个

查询etcd服务状态:

 etcd的数据文件损坏了,要做数据恢复,而我这是实验环境,没搞etcd备份就只能重置集群了

需要在每台机器上执行:

 

 

 然后初始化集群:

在master节点执行:

[root@master ~]# kubeadm init --ignore-preflight-errors=SystemVerification --cri-socket unix:///var/run/cri-dockerd.sock
I0409 07:02:10.310074   11794 version.go:256] remote version is much newer: v1.29.3; falling back to: stable-1.28
[init] Using Kubernetes version: v1.28.8
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

如果遇到报错:

[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1
解决办法:# echo "1">/proc/sys/net/bridge/bridge-nf-call-iptables

 

 当出现上面信息时,表示init初始化成功,然后在node1、node2节点进行加入到节点:

 

 

[root@master ~]# mkdir -p $HOME/.kube
[root@master ~]#   sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[root@master ~]#   sudo chown $(id -u):$(id -g) $HOME/.kube/config
[root@master ~]# kubectl get nodes
NAME     STATUS   ROLES           AGE   VERSION
master   Ready    control-plane   82s   v1.28.2
node1    Ready    <none>          17s   v1.28.2
node2    Ready    <none>          6s    v1.28.2

但是执行kubectl apply -f kube-flannel.yml一直报错:

 

Error registering network: failed to acquire lease: node "master" pod cidr not assigned

报错信息是cidr没有分配,于是继续reset,然后init的时候加上该参数:

[root@master ~]# kubeadm init --ignore-preflight-errors=SystemVerification --cri-socket unix:///var/run/cri-dockerd.sock --service-cidr=10.96.0.0/16 --pod-network-cidr=10.244.0.0/16

然后node1、node2继续join:

[root@node1 ~]# kubeadm join 192.168.77.100:6443 --token caidkf.z5ygdrotujz09y1z \
>         --discovery-token-ca-cert-hash sha256:a753ca9b794a43912b3bfca5e52a788ca222e672a3630879b585f2eb841fc65e --cri-socket unix:///var/run/cri-dockerd.sock

[root@node2 ~]# kubeadm join 192.168.77.100:6443 --token caidkf.z5ygdrotujz09y1z \
>         --discovery-token-ca-cert-hash sha256:a753ca9b794a43912b3bfca5e52a788ca222e672a3630879b585f2eb841fc65e --cri-socket unix:///var/run/cri-dockerd.sock

然后master节点做一些config配置:

[root@master ~]# rm -rf  /root/.kube/*
[root@master ~]# mkdir -p $HOME/.kube
[root@master ~]#   sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[root@master ~]#   sudo chown $(id -u):$(id -g) $HOME/.kube/config

集群状态,pod状态如下:

[root@master ~]# kubectl get nodes
NAME     STATUS   ROLES           AGE   VERSION
master   Ready    control-plane   59s   v1.28.2
node1    Ready    <none>          22s   v1.28.2
node2    Ready    <none>          27s   v1.28.2

 

如果上述过程中出现如下报错:

[root@master ~]# kubectl get nodes
Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
[root@master ~]# kubectl get nodes
Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

解决办法:

先删除config缓存,然后创建新的:

[root@master ~]# rm -rf  /root/.kube/*
[root@master ~]# mkdir -p $HOME/.kube
[root@master ~]#   sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
[root@master ~]#   sudo chown $(id -u):$(id -g) $HOME/.kube/config

 

参考文章:

https://zhuanlan.zhihu.com/p/646238661

https://blog.csdn.net/qq_40460909/article/details/114707380

https://blog.csdn.net/qq_21127151/article/details/120929170

 

posted on 2024-04-09 20:44  wadeson  阅读(5)  评论(0编辑  收藏  举报