K8S-记录一次奇葩问题:deployment不创建rs,更不创建pod
先说如何解决:这个issue:https://github.com/kelseyhightower/kubernetes-the-hard-way/issues/354
Moving the manifest file out of /etc/kubernetes/manifests, waiting a bit to check process has stopped, and putting it back works.
将清单文件移出/etc/kubernetes/manifest/,等待一段时间检查进程是否停止,然后将其放回去。
解决这个问题的路上简直是一波三折。
背景
开发找我说,这个服务一直没有响应,我说我先看看(内心os:我怎么知道,又不是我部署的!!!!)
解决的路上
网络
根据给定的URL使用curl测试了一下,确实无响应,timeout,我想着是网络的问题,排查了一下软路由上的DNS配置,K8S上的ingress和svc等,都没问题;
pod
那就是pod本身的问题撒,进入pod内部,妈的,什么命令都没有,看不到是不是端口号监听的有问题,尝试使用预定于的端口号,嗯,没有拒绝连接,但是还是timeout,端口号没问题;那就是容器有报错,查看了一下,报错连不上node-2节点上的Kafka,卧槽,kafka怎么停止了。
就在我要把kafka启动起来的时候,开发跟我说,这个pod的镜像版本低,不是最新版的,新版不依赖kafka了。
那就去更新镜像呗,打开deployment一看,不对啊,这个镜像版本就是最新的,但是跟pod的不一样。
卧槽,闯了鬼了。
docker 镜像不一致
bzkj@master-1:~$ kubectl -n application describe pod lpc-admin-management-service-6fc4ff5c6c-tqjf4 Name: lpc-admin-management-service-6fc4ff5c6c-tqjf4 Namespace: application Priority: 0 Node: worker-1/192.168.50.30 Start Time: Mon, 21 Apr 2025 08:33:14 +0800 Labels: app=lpc-admin-management-service pod-template-hash=6fc4ff5c6c Annotations: cni.projectcalico.org/containerID: ca1d394599d4708367c9074dd05630999e1cd0847b4476ee9ffed4ce95945e6e cni.projectcalico.org/podIP: 10.244.226.79/32 cni.projectcalico.org/podIPs: 10.244.226.79/32 Status: Running IP: 10.244.226.79 IPs: IP: 10.244.226.79 Controlled By: ReplicaSet/lpc-admin-management-service-6fc4ff5c6c Containers: lpc-admin-management-service: Container ID: containerd://7fd7ac06dd7b2080882305f503c6b81bade11f5c5ae5868ada244905483ae25e Image: harbor.bozhi.tech/luzhou-production-control/lpc-admin-management-service:TEST_20250418_282
bzkj@master-1:~$ kubectl -n application get deployments lpc-admin-management-service -o yaml |grep -i Image {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"lpc-admin-management-service","namespace":"application"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"lpc-admin-management-service"}},"template":{"metadata":{"labels":{"app":"lpc-admin-mana gement-service"}},"spec":{"containers":[{"image":"harbor.bozhi.tech/luzhou-production-control/lpc-admin-management-service:TEST_20250528_287","imagePullPolicy":"Always","name":"lpc-admin-management-service","ports":[{"containerPort":8080}],"volumeMounts":[{"mountPath":"/files","name":" nfs-storage"}]}],"volumes":[{"name":"nfs-storage","persistentVolumeClaim":{"claimName":"luzhou-production-control-pvc"}}]}}}} - image: harbor.bozhi.tech/luzhou-production-control/lpc-admin-management-service:TEST_20250528_287
奇了个怪,但是暂时不清楚原因,先尝试直接解决问题,我问docker镜像在哪儿,开发回我说不知道,他们使用Jenkins构建的,23333333333...作为一个憨憨运维,我不会Jenkins啊。
但是我最终还是找到了,登录到Jenkins里面,看到构建的历史日志输出,找到了docker的执行命令,由于Jenkins自动删除了镜像,我按照它原本的命令又给构建了,然后上传到harbor中去,顺带说,他们把harbor的密码遗失了,我很无语。
然后我到将镜像包手工导入到K8s里面,然后强制重启deployment:
bzkj@master-1:~$ kubectl -n application rollout restart deployment/lpc-admin-management-service deployment.apps/lpc-admin-management-service restarted bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE lpc-admin-management-service-6fc4ff5c6c-tqjf4 1/1 Running 22 (19d ago) 37d rpa-data-collect-meishan-765bd77d54-2cbk7 0/1 ImagePullBackOff 13 (37d ago) 256d bzkj@master-1:~$
看起来没问题,但是,不对!不对!压根没有更新。
那我直接将pod删除:
bzkj@master-1:~$ kubectl -n application delete pod lpc-admin-management-service-6fc4ff5c6c-tqjf4 pod "lpc-admin-management-service-6fc4ff5c6c-tqjf4" deleted bzkj@master-1:~$ kubectl -n application get deployment NAME READY UP-TO-DATE AVAILABLE AGE lpc-admin-management-service 1/1 1 1 385d rpa-data-collect-meishan 0/1 1 0 322d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 0/1 ImagePullBackOff 13 (37d ago) 256d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 0/1 ImagePullBackOff 13 (37d ago) 256d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 0/1 ImagePullBackOff 13 (37d ago) 256d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 0/1 ImagePullBackOff 13 (37d ago) 256d bzkj@master-1:~$
卧槽,pod直接没有了,也没有新的pod产生了。
deployment的问题?
查询RS:
bzkj@master-1:~$ kubectl -n application get rs | grep lpc-admin lpc-admin-management-service-589fb58f5b 0 0 0 40d lpc-admin-management-service-5994b968d4 0 0 0 40d lpc-admin-management-service-59cd87ff54 0 0 0 40d lpc-admin-management-service-67c8b65dcc 0 0 0 40d lpc-admin-management-service-6f9fbd864f 0 0 0 40d lpc-admin-management-service-6fc4ff5c6c 1 1 1 39d lpc-admin-management-service-74b5d7d74 0 0 0 40d lpc-admin-management-service-79fccbc4c5 0 0 0 40d lpc-admin-management-service-7bbb6d58ff 0 0 0 40d lpc-admin-management-service-85f66c88b 0 0 0 47d lpc-admin-management-service-9d8df5ccc 0 0 0 40d bzkj@master-1:~$
奇怪了,为啥【lpc-admin-management-service-6fc4ff5c6c】不是最新的那个。
不管了,多余的RS都删除掉:
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-589fb58f5b kubectl -n application delete rs lpc-admin-management-service-67c8b65dcc kubectl -n application delete rs lpc-admin-management-service-6f9fbd864f kubectl -n application delete rs lpc-admin-management-service-74b5d7d74 kubectl -n application delete rs lpc-admin-management-service-79fccbc4c5 kubectl -n application delete rs lpc-admin-management-service-7bbb6d58ff kubectl -n application delete rs lpc-admin-management-service-85f66c88b kubectl -n application delete rs lpc-admin-management-service-9d8df5ccc replicaset.apps "lpc-admin-management-service-589fb58f5b" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-5994b968d4 replicaset.apps "lpc-admin-management-service-5994b968d4" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-59cd87ff54 replicaset.apps "lpc-admin-management-service-59cd87ff54" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-67c8b65dcc replicaset.apps "lpc-admin-management-service-67c8b65dcc" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-6f9fbd864f replicaset.apps "lpc-admin-management-service-6f9fbd864f" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-74b5d7d74 replicaset.apps "lpc-admin-management-service-74b5d7d74" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-79fccbc4c5 replicaset.apps "lpc-admin-management-service-79fccbc4c5" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-7bbb6d58ff replicaset.apps "lpc-admin-management-service-7bbb6d58ff" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-85f66c88b replicaset.apps "lpc-admin-management-service-85f66c88b" deleted bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-9d8df5ccc replicaset.apps "lpc-admin-management-service-9d8df5ccc" deleted bzkj@master-1:~$ kubectl -n application get rs NAME DESIRED CURRENT READY AGE lpc-admin-management-service-6fc4ff5c6c 1 1 1 39d rpa-data-collect-meishan-545fcc6696 0 0 0 312d rpa-data-collect-meishan-54fdc7745c 0 0 0 314d rpa-data-collect-meishan-587b66d649 0 0 0 312d rpa-data-collect-meishan-64bf4fc5bb 0 0 0 312d rpa-data-collect-meishan-66f645fc7b 0 0 0 313d rpa-data-collect-meishan-68b6f69b87 0 0 0 314d rpa-data-collect-meishan-765bd77d54 1 1 0 261d rpa-data-collect-meishan-77ff69b9ff 0 0 0 264d rpa-data-collect-meishan-866d4594b9 0 0 0 313d rpa-data-collect-meishan-cb596d659 0 0 0 312d rpa-data-collect-meishan-d97596dbb 0 0 0 313d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 1/1 Running 14 (37d ago) 256d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 1/1 Running 14 (37d ago) 256d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 1/1 Running 14 (37d ago) 256d bzkj@master-1:~$
仍旧不行,那么我重新创建deployment呢?
bzkj@master-1:~$ kubectl -n application get deployment lpc-admin-management-service -o yaml > lpc-admin-management-service.yml bzkj@master-1:~$ cp lpc-admin-management-service.yml lpc-admin-management-service_new.tml bzkj@master-1:~$ mv lpc-admin-management-service_new.tml lpc-admin-management-service_new.yml bzkj@master-1:~$ vim lpc-admin-management-service_new.yml bzkj@master-1:~$ kubectl -n application delete deployment lpc-admin-management-service deployment.apps "lpc-admin-management-service" deleted bzkj@master-1:~$ kubectl -n application get deployment NAME READY UP-TO-DATE AVAILABLE AGE rpa-data-collect-meishan 0/1 1 0 322d bzkj@master-1:~$ kubectl -n application apply -f lpc-admin-management-service_new.yml deployment.apps/lpc-admin-management-service created bzkj@master-1:~$ kubectl -n application get deployment NAME READY UP-TO-DATE AVAILABLE AGE lpc-admin-management-service 0/1 0 0 3s rpa-data-collect-meishan 0/1 1 0 322d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 1/1 Running 14 (37d ago) 256d bzkj@master-1:~$ kubectl -n application get pod NAME READY STATUS RESTARTS AGE rpa-data-collect-meishan-765bd77d54-2cbk7 1/1 Running 14 (37d ago) 256d bzkj@master-1:~$
什么!!!震惊我一百年,也不行。那就是控制平面的问题了。
控制平面
bzkj@master-1:~$ kubectl get pods -n kube-system -l component=kube-controller-manager NAME READY STATUS RESTARTS AGE kube-controller-manager-master-1 1/1 Running 21 (16d ago) 393d bzkj@master-1:~$ kubectl get pods -n kube-system -l component=kube-controller-manager NAME READY STATUS RESTARTS AGE kube-controller-manager-master-1 1/1 Running 21 (16d ago) 393d bzkj@master-1:~$ kubectl -n kube-system logs kube-controller-manager-master-1 E0527 00:47:08.098716 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0527 00:47:12.442945 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0527 00:47:16.099276 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0527 00:47:19.772645 1 leaderelection.go:330] error retrieving resource l ...... E0528 05:46:37.449440 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0528 05:46:41.470394 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0528 05:46:44.264729 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0528 05:46:46.767760 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0528 05:46:48.976908 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0528 05:46:51.604018 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0528 05:46:53.842804 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized E0528 05:46:56.282280 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
卧槽,控制平面一直在报错。
问题原因找到了:
Kubernetes 控制平面(尤其是 kube-controller-manager)失去了对 API Server 的访问权限或凭据配置出错,导致它无法正常工作——Deployment 无法创建 ReplicaSet 和 Pod 正是这个后果。
我突然想起来,我上周给更新了集群证书!!
然后我Google了一下,官方issue上有对应的问题:
Problems with authorization of kube-controller-manager and kube-scheduler #354

我按照他的方法,成功解决了问题:
bzkj@master-1:~$ mkdir 20250528 bzkj@master-1:~$ cd 20250528bzkj@master-1:~/20250528$ sudo mv /etc/kubernetes/manifests/* . bzkj@master-1:~/20250528$ ll total 24 drwxrwxr-x 2 bzkj bzkj 4096 May 28 14:09 ./ drwxr-xr-x 11 bzkj bzkj 4096 May 28 14:09 ../ -rw------- 1 root root 2248 Apr 29 2024 etcd.yaml -rw------- 1 root root 4038 Apr 30 2024 kube-apiserver.yaml -rw------- 1 root root 3544 Apr 29 2024 kube-controller-manager.yaml -rw------- 1 root root 1464 Apr 29 2024 kube-scheduler.yaml bzkj@master-1:~/20250528$ kubectl -n kube-system logs --tail=20 kube-controller-manager-master-1 The connection to the server master-1:6443 was refused - did you specify the right host or port? bzkj@master-1:~/20250528$ sudo mv /etc/kubernetes/manifests/* .^C bzkj@master-1:~/20250528$ sudo mv ./* /etc/kubernetes/manifests/ bzkj@master-1:~/20250528$ ls /etc/kubernetes/manifests/ etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml bzkj@master-1:~/20250528$ kubectl -n kube-system logs --tail=20 kube-controller-manager-master-1 I0528 06:12:18.955451 1 event.go:294] "Event occurred" object="application/rpa-data-collect-meishan-57998bfdb6" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: rpa-data-collect-meishan-57998bfdb6-ltnjv" I0528 06:12:18.972110 1 event.go:294] "Event occurred" object="application/lpc-admin-management-service-79cb448cb8" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: lpc-admin-management-service-79cb448cb8-ts7xl" I0528 06:12:19.035894 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:19.077489 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:19.659745 1 shared_informer.go:247] Caches are synced for garbage collector I0528 06:12:19.659762 1 garbagecollector.go:155] Garbage collector: all resource monitors have synced. Proceeding to collect garbage I0528 06:12:19.685275 1 shared_informer.go:247] Caches are synced for garbage collector I0528 06:12:20.803481 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:21.452319 1 event.go:294] "Event occurred" object="application/rpa-data-collect-meishan" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled down replica set rpa-data-collect-meishan-765bd77d54 to 0" I0528 06:12:21.487730 1 event.go:294] "Event occurred" object="application/rpa-data-collect-meishan-765bd77d54" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: rpa-data-collect-meishan-765bd77d54-2cbk7" I0528 06:12:26.047795 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:26.929053 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:27.935853 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:35.957399 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:35.983207 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:35.992803 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:35.993572 1 event.go:294] "Event occurred" object="kubesphere-system/openpitrix-import-job" kind="Job" apiVersion="batch/v1" type="Normal" reason="Completed" message="Job completed" I0528 06:12:36.004219 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:36.029378 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job I0528 06:12:36.111154 1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job bzkj@master-1:~/20250528$ kubectl get rs -n application NAME DESIRED CURRENT READY AGE lpc-admin-management-service-79cb448cb8 1 1 1 63s rpa-data-collect-meishan-545fcc6696 0 0 0 312d rpa-data-collect-meishan-57998bfdb6 1 1 1 63s rpa-data-collect-meishan-587b66d649 0 0 0 312d rpa-data-collect-meishan-64bf4fc5bb 0 0 0 312d rpa-data-collect-meishan-66f645fc7b 0 0 0 313d rpa-data-collect-meishan-68b6f69b87 0 0 0 314d rpa-data-collect-meishan-765bd77d54 0 0 0 261d rpa-data-collect-meishan-77ff69b9ff 0 0 0 264d rpa-data-collect-meishan-866d4594b9 0 0 0 313d rpa-data-collect-meishan-cb596d659 0 0 0 312d rpa-data-collect-meishan-d97596dbb 0 0 0 313d bzkj@master-1:~/20250528$ kubectl get pod -n application NAME READY STATUS RESTARTS AGE lpc-admin-management-service-79cb448cb8-ts7xl 1/1 Running 0 70s rpa-data-collect-meishan-57998bfdb6-ltnjv 1/1 Running 0 70s bzkj@master-1:~/20250528$
就此解决了问题。
浙公网安备 33010602011771号