狂自私

导航

K8s集群证书过期导致api-server启动不起来

基础环境:

Welcome to Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-215-generic x86_64)
K8s v1.23.4

现象

今天有开发问我这个是咋回事:

bzkj@master-1:~$ kubectl get all
The connection to the server master-1:6443 was refused - did you specify the right host or port?

排查

6443端口,这个是K8s的API-Server服务端口,看看api-server的状态:

bzkj@master-1:~$ sudo crictl pods --namespace kube-system | grep kube-apiserver
I0522 11:05:44.122834 1480897 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/run/containerd/containerd.sock" URL="unix:///run/containerd/containerd.sock"
42ac3c004d87b       10 days ago         Ready               kube-apiserver-master-1                    kube-system         12                  (default)

是Ready啊,感觉没问题呢,看看日志输出:

bzkj@master-1:~$ sudo journalctl -u kubelet -f | grep kube-apiserver
May 22 11:06:08 master-1 kubelet[738488]: E0522 11:06:08.486887  738488 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-maste-1_kube-system(36ea8f7ae6c2f0edb761b0d54141bbd7)\"" pod="kube-system/kube-apiserver-master-1" podUID=36ea8f7ae6c2f0edb761b0d54141bbd7
May 22 11:06:20 master-1 kubelet[738488]: E0522 11:06:20.486171  738488 pod_workers.go:919] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-maste-1_kube-system(36ea8f7ae6c2f0edb761b0d54141bbd7)\"" pod="kube-system/kube-apiserver-master-1" podUID=36ea8f7ae6c2f0edb761b0d54141bbd7

哦,卧槽了,他实际上处于CrashLoopBackOff 状态,看看apiserver pod的日志:

bzkj@master-1:~$ POD_ID=$(sudo crictl pods --namespace kube-system | grep kube-apiserver | awk '{print $1}')
I0522 11:07:48.201061 1481075 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/run/containerd/containerd.sock" URL="unix:///run/containerd/containerd.sock"
bzkj@master-1:~$ sudo crictl logs $POD_ID
I0522 11:07:51.248795 1481080 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/run/containerd/containerd.sock" URL="unix:///run/containerd/containerd.sock"
E0522 11:07:51.300633 1481080 remote_runtime.go:415] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"42ac3c004d87b\": not found" containerID="42ac3c004d87b"
FATA[0000] rpc error: code = NotFound desc = an error occurred when try to find container "42ac3c004d87b": not found

卧槽,看不到,怎么说找不到42ac3c004d87b?哦,因为apiserver pod一直在重启。id一直在更新。那尝试获取一下宿主机上的日志文件,这次搞快一点:

bzkj@master-1:~$ CONTAINER_ID=$(sudo crictl ps -a kube-system | grep kube-apiserver | awk '{print $1}')
bzkj@master-1:~$ sudo crictl logs --previous $CONTAINER_ID
I0522 11:20:16.104289 1482432 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/run/containerd/containerd.sock" URL="unix:///run/containerd/containerd.sock"
FATA[0000] failed to try resolving symlinks in path "/var/log/pods/kube-system_kube-apiserver-master-1_36ea8f7ae6c2f0edb761b0d54141bbd7/kube-apiserver/3056.log": lstat /var/log/pods/kube-system_kube-apiserver-master-1_36ea8f7ae6c2f0edb761b0d54141bbd7/kube-apiserver/3056.log: no such file or directory

好,获取到日志文件名称了:

bzkj@master-1:/var/log/pods$ sudo cat  kube-system_kube-apiserver-master-1_36ea8f7ae6c2f0edb761b0d54141bbd7/kube-apiserver/3058.log
2025-05-22T11:21:38.799137216+08:00 stderr F I0522 03:21:38.798980       1 server.go:565] external host was not specified, using 192.168.50.40
2025-05-22T11:21:38.799810875+08:00 stderr F I0522 03:21:38.799702       1 server.go:172] Version: v1.23.4
2025-05-22T11:21:39.28439832+08:00 stderr F I0522 03:21:39.284290       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
2025-05-22T11:21:39.285390374+08:00 stderr F I0522 03:21:39.285320       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2025-05-22T11:21:39.285400631+08:00 stderr F I0522 03:21:39.285335       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2025-05-22T11:21:39.286395845+08:00 stderr F I0522 03:21:39.286350       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2025-05-22T11:21:39.286401034+08:00 stderr F I0522 03:21:39.286360       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2025-05-22T11:21:39.292322232+08:00 stderr F W0522 03:21:39.292243       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:39Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:40.28973601+08:00 stderr F W0522 03:21:40.289361       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:40Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:40.296484042+08:00 stderr F W0522 03:21:40.296228       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:40Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:41.298408532+08:00 stderr F W0522 03:21:41.298005       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:41Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:42.121598484+08:00 stderr F W0522 03:21:42.121377       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:42Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:42.811164825+08:00 stderr F W0522 03:21:42.810663       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:42Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:44.714735338+08:00 stderr F W0522 03:21:44.714184       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:44Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:45.598608891+08:00 stderr F W0522 03:21:45.598075       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:45Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:48.692639849+08:00 stderr F W0522 03:21:48.692385       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:48Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:49.973040636+08:00 stderr F W0522 03:21:49.972921       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:49Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:55.393220026+08:00 stderr F W0522 03:21:55.393069       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:55Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:56.983728916+08:00 stderr F W0522 03:21:56.983605       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2025-05-22T03:21:56Z is after 2025-04-29T09:26:22Z". Reconnecting...
2025-05-22T11:21:59.290967676+08:00 stderr F E0522 03:21:59.289319       1 run.go:74] "command failed" err="context deadline exceeded"
bzkj@master-1:/var/log/pods$

好了,原因确定了,是证书过期了。

反思

我是不熟悉,要不然,第一时间应该用这个命令直接就可以看到:

bzkj@master-1:~$ sudo kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration

CERTIFICATE                EXPIRES                  RESIDUAL TIME   CERTIFICATE AUTHORITY   EXTERNALLY MANAGED
admin.conf                 Apr 29, 2025 09:26 UTC   <invalid>       ca                      no
apiserver                  Apr 29, 2025 09:26 UTC   <invalid>       ca                      no
apiserver-etcd-client      Apr 29, 2025 09:26 UTC   <invalid>       etcd-ca                 no
apiserver-kubelet-client   Apr 29, 2025 09:26 UTC   <invalid>       ca                      no
controller-manager.conf    Apr 29, 2025 09:26 UTC   <invalid>       ca                      no
etcd-healthcheck-client    Apr 29, 2025 09:26 UTC   <invalid>       etcd-ca                 no
etcd-peer                  Apr 29, 2025 09:26 UTC   <invalid>       etcd-ca                 no
etcd-server                Apr 29, 2025 09:26 UTC   <invalid>       etcd-ca                 no
front-proxy-client         Apr 29, 2025 09:26 UTC   <invalid>       front-proxy-ca          no
scheduler.conf             Apr 29, 2025 09:26 UTC   <invalid>       ca                      no

CERTIFICATE AUTHORITY   EXPIRES                  RESIDUAL TIME   EXTERNALLY MANAGED
ca                      Apr 27, 2034 09:26 UTC   8y              no
etcd-ca                 Apr 27, 2034 09:26 UTC   8y              no
front-proxy-ca          Apr 27, 2034 09:26 UTC   8y              no
bzkj@master-1:~$

看,RESIDUAL TIME列下面,全都是invalid(无效的)。

更换证书

1. 备份当前证书和配置

在进行任何更改之前,先备份现有的证书和配置文件,以防万一。

sudo cp -r /etc/kubernetes /etc/kubernetes.bak

2. 使用 kubeadm 更新证书

kubeadm 可以自动更新证书,但是默认情况下它只会更新证书的有效期为一年。如果你想设置更长的有效期(例如 10 年),你需要手动更新证书的有效期。

3. 自动更新证书(默认有效期为1年)

sudo kubeadm certs renew all

这将更新所有证书,包括 API Server 的证书。

4. 更新并重启 Kubelet

完成证书更新后,需要重启 kubelet 使更改生效:

sudo systemctl restart kubelet

重启后,Kubernetes 组件将会加载新的证书。

5. 验证集群状态

确认集群的 API Server 启动正常,可以通过以下命令检查:

kubectl get nodes

如果没有输出错误,那么证书更新应该已经成功。

问题

Unauthorized

bzkj@master-1:/etc/kubernetes$ sudo kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed

Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.
bzkj@master-1:/etc/kubernetes$ sudo systemctl restart kubelet
bzkj@master-1:/etc/kubernetes$ kubectl get nodes
The connection to the server master-1:6443 was refused - did you specify the right host or port?
bzkj@master-1:/etc/kubernetes$ kubectl get nodes
error: You must be logged in to the server (Unauthorized)

从输出信息来看,证书更新已经完成,但 kubectl 仍然无法连接到 API Server,并且出现了 Unauthorized 错误。这通常是因为:

  1. API Server 还没有重启,新的证书没有生效。
  2. Kubeconfig 文件中的认证信息需要更新。

步骤 1:重启 Kubernetes 控制平面组件

更新证书后,已经重启了 kubelet,但是 Kubernetes 控制平面组件(如 kube-apiserverkube-controller-managerkube-scheduler 等)还没有重新启动,导致它们没有加载新的证书。

由于使用 kubeadm 部署,API Server 可能是作为静态 Pod 运行的,通常在 /etc/kubernetes/manifests 目录下。如果是这种情况,kubelet 会自动检测到配置变化并重启控制平面组件。你可以通过检查 API Server Pod 是否正在重启来确认:

ls /etc/kubernetes/manifests/

应该能看到类似 kube-apiserver.yaml 的文件。检查文件后,可以强制重启 API Server:

sudo systemctl restart kubelet

或者你可以通过删除 kube-apiserver Pod 来强制重启它(因为 kubelet 会检测到并重新创建它):

kubectl delete pod -n kube-system -l component=kube-apiserver

步骤 2:确保新的证书被加载

你可以检查 kube-apiserver 日志,以确认它是否加载了新的证书:

journalctl -u kubelet -f

查看日志中是否有证书相关的错误,或者确认 API Server 是否已成功启动。

步骤 3:检查 kubeconfig 文件中的认证信息

如果 API Server 已经重新启动并且正常工作,kubectl 仍然可能会遇到认证问题。通常,kubectl 使用的 kubeconfig 文件也需要更新以匹配新证书。

你可以手动更新 kubeconfig 文件的认证信息,或者重新生成并替换它:

  1. 备份原有的 kubeconfig 文件(通常位于 ~/.kube/config):

    mv ~/.kube/config ~/.kube/config.bak
    
  2. 重新生成 kubeconfig 文件
    运行以下命令来重新生成并下载新的 kubeconfig 文件:

    sudo kubeadm init phase kubeconfig admin
    

    这将生成一个新的 admin.conf 文件,并将其放置在 /etc/kubernetes/admin.conf,你可以将其复制到 ~/.kube/config 目录下。

  3. 将新的 kubeconfig 文件复制到当前用户的 .kube/config

    sudo cp /etc/kubernetes/admin.conf ~/.kube/config
    

    确保当前用户有访问权限:

    sudo chown $(id -u):$(id -g) ~/.kube/config
    

步骤 4:验证集群是否恢复

完成以上操作后,再次验证 kubectl 是否能够连接到集群:

kubectl get nodes

如果 API Server 已经正常启动并且 kubeconfig 文件已更新,kubectl 应该能够成功连接并显示集群节点信息。


总结

  1. 确保 Kubernetes 控制平面组件(如 kube-apiserver)已经成功重启并加载新的证书。
  2. 更新 kubeconfig 文件,确保认证信息正确。
  3. 使用 kubectl get nodes 来验证集群状态。

posted on 2025-05-22 13:20  狂自私  阅读(346)  评论(0)    收藏  举报