狂自私

导航

K8S-记录一次奇葩问题:deployment不创建rs,更不创建pod

先说如何解决:这个issue:https://github.com/kelseyhightower/kubernetes-the-hard-way/issues/354

Moving the manifest file out of /etc/kubernetes/manifests, waiting a bit to check process has stopped, and putting it back works.

将清单文件移出/etc/kubernetes/manifest/,等待一段时间检查进程是否停止,然后将其放回去。

 

解决这个问题的路上简直是一波三折。

背景

开发找我说,这个服务一直没有响应,我说我先看看(内心os:我怎么知道,又不是我部署的!!!!

解决的路上

网络

根据给定的URL使用curl测试了一下,确实无响应,timeout,我想着是网络的问题,排查了一下软路由上的DNS配置,K8S上的ingress和svc等,都没问题;

pod

那就是pod本身的问题撒,进入pod内部,妈的,什么命令都没有,看不到是不是端口号监听的有问题,尝试使用预定于的端口号,嗯,没有拒绝连接,但是还是timeout,端口号没问题;那就是容器有报错,查看了一下,报错连不上node-2节点上的Kafka,卧槽,kafka怎么停止了。

就在我要把kafka启动起来的时候,开发跟我说,这个pod的镜像版本低,不是最新版的,新版不依赖kafka了。

那就去更新镜像呗,打开deployment一看,不对啊,这个镜像版本就是最新的,但是跟pod的不一样。

卧槽,闯了鬼了。

docker 镜像不一致

 

bzkj@master-1:~$ kubectl  -n application describe pod  lpc-admin-management-service-6fc4ff5c6c-tqjf4
Name:         lpc-admin-management-service-6fc4ff5c6c-tqjf4
Namespace:    application
Priority:     0
Node:         worker-1/192.168.50.30
Start Time:   Mon, 21 Apr 2025 08:33:14 +0800
Labels:       app=lpc-admin-management-service
              pod-template-hash=6fc4ff5c6c
Annotations:  cni.projectcalico.org/containerID: ca1d394599d4708367c9074dd05630999e1cd0847b4476ee9ffed4ce95945e6e
              cni.projectcalico.org/podIP: 10.244.226.79/32
              cni.projectcalico.org/podIPs: 10.244.226.79/32
Status:       Running
IP:           10.244.226.79
IPs:
  IP:           10.244.226.79
Controlled By:  ReplicaSet/lpc-admin-management-service-6fc4ff5c6c
Containers:
  lpc-admin-management-service:
    Container ID:   containerd://7fd7ac06dd7b2080882305f503c6b81bade11f5c5ae5868ada244905483ae25e
    Image:          harbor.bozhi.tech/luzhou-production-control/lpc-admin-management-service:TEST_20250418_282
bzkj@master-1:~$ kubectl  -n application get  deployments lpc-admin-management-service -o yaml |grep -i Image
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"lpc-admin-management-service","namespace":"application"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"lpc-admin-management-service"}},"template":{"metadata":{"labels":{"app":"lpc-admin-mana                           gement-service"}},"spec":{"containers":[{"image":"harbor.bozhi.tech/luzhou-production-control/lpc-admin-management-service:TEST_20250528_287","imagePullPolicy":"Always","name":"lpc-admin-management-service","ports":[{"containerPort":8080}],"volumeMounts":[{"mountPath":"/files","name":"                           nfs-storage"}]}],"volumes":[{"name":"nfs-storage","persistentVolumeClaim":{"claimName":"luzhou-production-control-pvc"}}]}}}}
      - image: harbor.bozhi.tech/luzhou-production-control/lpc-admin-management-service:TEST_20250528_287

奇了个怪,但是暂时不清楚原因,先尝试直接解决问题,我问docker镜像在哪儿,开发回我说不知道,他们使用Jenkins构建的,23333333333...作为一个憨憨运维,我不会Jenkins啊。

但是我最终还是找到了,登录到Jenkins里面,看到构建的历史日志输出,找到了docker的执行命令,由于Jenkins自动删除了镜像,我按照它原本的命令又给构建了,然后上传到harbor中去,顺带说,他们把harbor的密码遗失了,我很无语。

然后我到将镜像包手工导入到K8s里面,然后强制重启deployment:

bzkj@master-1:~$ kubectl -n application rollout restart deployment/lpc-admin-management-service
deployment.apps/lpc-admin-management-service restarted
bzkj@master-1:~$ kubectl  -n application get  pod
NAME                                            READY   STATUS             RESTARTS       AGE
lpc-admin-management-service-6fc4ff5c6c-tqjf4   1/1     Running            22 (19d ago)   37d
rpa-data-collect-meishan-765bd77d54-2cbk7       0/1     ImagePullBackOff   13 (37d ago)   256d
bzkj@master-1:~$

看起来没问题,但是,不对!不对!压根没有更新。

那我直接将pod删除:

bzkj@master-1:~$ kubectl -n application delete pod lpc-admin-management-service-6fc4ff5c6c-tqjf4
pod "lpc-admin-management-service-6fc4ff5c6c-tqjf4" deleted
bzkj@master-1:~$ kubectl  -n application get  deployment
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
lpc-admin-management-service   1/1     1            1           385d
rpa-data-collect-meishan       0/1     1            0           322d
bzkj@master-1:~$ kubectl  -n application get  pod
NAME                                        READY   STATUS             RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   0/1     ImagePullBackOff   13 (37d ago)   256d
bzkj@master-1:~$ kubectl  -n application get  pod
NAME                                        READY   STATUS             RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   0/1     ImagePullBackOff   13 (37d ago)   256d
bzkj@master-1:~$ kubectl  -n application get  pod
NAME                                        READY   STATUS             RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   0/1     ImagePullBackOff   13 (37d ago)   256d
bzkj@master-1:~$ kubectl  -n application get  pod
NAME                                        READY   STATUS             RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   0/1     ImagePullBackOff   13 (37d ago)   256d
bzkj@master-1:~$

卧槽,pod直接没有了,也没有新的pod产生了。

deployment的问题?

查询RS:

bzkj@master-1:~$ kubectl -n application get rs | grep lpc-admin
lpc-admin-management-service-589fb58f5b   0         0         0       40d
lpc-admin-management-service-5994b968d4   0         0         0       40d
lpc-admin-management-service-59cd87ff54   0         0         0       40d
lpc-admin-management-service-67c8b65dcc   0         0         0       40d
lpc-admin-management-service-6f9fbd864f   0         0         0       40d
lpc-admin-management-service-6fc4ff5c6c   1         1         1       39d
lpc-admin-management-service-74b5d7d74    0         0         0       40d
lpc-admin-management-service-79fccbc4c5   0         0         0       40d
lpc-admin-management-service-7bbb6d58ff   0         0         0       40d
lpc-admin-management-service-85f66c88b    0         0         0       47d
lpc-admin-management-service-9d8df5ccc    0         0         0       40d
bzkj@master-1:~$

奇怪了,为啥【lpc-admin-management-service-6fc4ff5c6c】不是最新的那个。

不管了,多余的RS都删除掉:

bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-589fb58f5b
kubectl -n application delete rs lpc-admin-management-service-67c8b65dcc
kubectl -n application delete rs lpc-admin-management-service-6f9fbd864f
kubectl -n application delete rs lpc-admin-management-service-74b5d7d74
kubectl -n application delete rs lpc-admin-management-service-79fccbc4c5
kubectl -n application delete rs lpc-admin-management-service-7bbb6d58ff
kubectl -n application delete rs lpc-admin-management-service-85f66c88b
kubectl -n application delete rs lpc-admin-management-service-9d8df5ccc
replicaset.apps "lpc-admin-management-service-589fb58f5b" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-5994b968d4
replicaset.apps "lpc-admin-management-service-5994b968d4" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-59cd87ff54
replicaset.apps "lpc-admin-management-service-59cd87ff54" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-67c8b65dcc
replicaset.apps "lpc-admin-management-service-67c8b65dcc" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-6f9fbd864f
replicaset.apps "lpc-admin-management-service-6f9fbd864f" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-74b5d7d74
replicaset.apps "lpc-admin-management-service-74b5d7d74" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-79fccbc4c5
replicaset.apps "lpc-admin-management-service-79fccbc4c5" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-7bbb6d58ff
replicaset.apps "lpc-admin-management-service-7bbb6d58ff" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-85f66c88b
replicaset.apps "lpc-admin-management-service-85f66c88b" deleted
bzkj@master-1:~$ kubectl -n application delete rs lpc-admin-management-service-9d8df5ccc
replicaset.apps "lpc-admin-management-service-9d8df5ccc" deleted
bzkj@master-1:~$ kubectl -n application get rs
NAME                                      DESIRED   CURRENT   READY   AGE
lpc-admin-management-service-6fc4ff5c6c   1         1         1       39d
rpa-data-collect-meishan-545fcc6696       0         0         0       312d
rpa-data-collect-meishan-54fdc7745c       0         0         0       314d
rpa-data-collect-meishan-587b66d649       0         0         0       312d
rpa-data-collect-meishan-64bf4fc5bb       0         0         0       312d
rpa-data-collect-meishan-66f645fc7b       0         0         0       313d
rpa-data-collect-meishan-68b6f69b87       0         0         0       314d
rpa-data-collect-meishan-765bd77d54       1         1         0       261d
rpa-data-collect-meishan-77ff69b9ff       0         0         0       264d
rpa-data-collect-meishan-866d4594b9       0         0         0       313d
rpa-data-collect-meishan-cb596d659        0         0         0       312d
rpa-data-collect-meishan-d97596dbb        0         0         0       313d
bzkj@master-1:~$ kubectl -n application get pod
NAME                                        READY   STATUS    RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   1/1     Running   14 (37d ago)   256d
bzkj@master-1:~$ kubectl -n application get pod
NAME                                        READY   STATUS    RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   1/1     Running   14 (37d ago)   256d
bzkj@master-1:~$ kubectl -n application get pod
NAME                                        READY   STATUS    RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   1/1     Running   14 (37d ago)   256d
bzkj@master-1:~$ 

仍旧不行,那么我重新创建deployment呢?

bzkj@master-1:~$ kubectl -n application get deployment lpc-admin-management-service -o yaml > lpc-admin-management-service.yml
bzkj@master-1:~$ cp lpc-admin-management-service.yml lpc-admin-management-service_new.tml
bzkj@master-1:~$ mv lpc-admin-management-service_new.tml lpc-admin-management-service_new.yml
bzkj@master-1:~$ vim lpc-admin-management-service_new.yml
bzkj@master-1:~$ kubectl -n application delete deployment lpc-admin-management-service
deployment.apps "lpc-admin-management-service" deleted
bzkj@master-1:~$ kubectl -n application get deployment
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
rpa-data-collect-meishan   0/1     1            0           322d
bzkj@master-1:~$ kubectl -n application apply -f lpc-admin-management-service_new.yml
deployment.apps/lpc-admin-management-service created
bzkj@master-1:~$ kubectl -n application get deployment
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
lpc-admin-management-service   0/1     0            0           3s
rpa-data-collect-meishan       0/1     1            0           322d
bzkj@master-1:~$ kubectl -n application get pod
NAME                                        READY   STATUS    RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   1/1     Running   14 (37d ago)   256d
bzkj@master-1:~$ kubectl -n application get pod
NAME                                        READY   STATUS    RESTARTS       AGE
rpa-data-collect-meishan-765bd77d54-2cbk7   1/1     Running   14 (37d ago)   256d
bzkj@master-1:~$

什么!!!震惊我一百年,也不行。那就是控制平面的问题了。

控制平面

bzkj@master-1:~$ kubectl get pods -n kube-system -l component=kube-controller-manager
NAME                               READY   STATUS    RESTARTS       AGE
kube-controller-manager-master-1   1/1     Running   21 (16d ago)   393d
bzkj@master-1:~$ kubectl get pods -n kube-system -l component=kube-controller-manager
NAME                               READY   STATUS    RESTARTS       AGE
kube-controller-manager-master-1   1/1     Running   21 (16d ago)   393d
bzkj@master-1:~$ kubectl -n kube-system logs kube-controller-manager-master-1
E0527 00:47:08.098716       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0527 00:47:12.442945       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0527 00:47:16.099276       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0527 00:47:19.772645       1 leaderelection.go:330] error retrieving resource l
......
E0528 05:46:37.449440       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0528 05:46:41.470394       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0528 05:46:44.264729       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0528 05:46:46.767760       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0528 05:46:48.976908       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0528 05:46:51.604018       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0528 05:46:53.842804       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized
E0528 05:46:56.282280       1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Unauthorized

卧槽,控制平面一直在报错。

问题原因找到了:

 Kubernetes 控制平面(尤其是 kube-controller-manager)失去了对 API Server 的访问权限或凭据配置出错,导致它无法正常工作——Deployment 无法创建 ReplicaSet 和 Pod 正是这个后果。

我突然想起来,我上周给更新了集群证书!!

然后我Google了一下,官方issue上有对应的问题:

Problems with authorization of kube-controller-manager and kube-scheduler #354

我按照他的方法,成功解决了问题:

bzkj@master-1:~$ mkdir 20250528
bzkj@master-1:~$ cd  20250528bzkj@master-1:~/20250528$ sudo mv /etc/kubernetes/manifests/* .
bzkj@master-1:~/20250528$ ll
total 24
drwxrwxr-x  2 bzkj bzkj 4096 May 28 14:09 ./
drwxr-xr-x 11 bzkj bzkj 4096 May 28 14:09 ../
-rw-------  1 root root 2248 Apr 29  2024 etcd.yaml
-rw-------  1 root root 4038 Apr 30  2024 kube-apiserver.yaml
-rw-------  1 root root 3544 Apr 29  2024 kube-controller-manager.yaml
-rw-------  1 root root 1464 Apr 29  2024 kube-scheduler.yaml
bzkj@master-1:~/20250528$ kubectl -n kube-system logs --tail=20 kube-controller-manager-master-1
The connection to the server master-1:6443 was refused - did you specify the right host or port?
bzkj@master-1:~/20250528$ sudo mv /etc/kubernetes/manifests/* .^C
bzkj@master-1:~/20250528$ sudo mv ./* /etc/kubernetes/manifests/
bzkj@master-1:~/20250528$ ls /etc/kubernetes/manifests/
etcd.yaml  kube-apiserver.yaml  kube-controller-manager.yaml  kube-scheduler.yaml
bzkj@master-1:~/20250528$ kubectl -n kube-system logs --tail=20 kube-controller-manager-master-1
I0528 06:12:18.955451       1 event.go:294] "Event occurred" object="application/rpa-data-collect-meishan-57998bfdb6" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: rpa-data-collect-meishan-57998bfdb6-ltnjv"
I0528 06:12:18.972110       1 event.go:294] "Event occurred" object="application/lpc-admin-management-service-79cb448cb8" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: lpc-admin-management-service-79cb448cb8-ts7xl"
I0528 06:12:19.035894       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:19.077489       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:19.659745       1 shared_informer.go:247] Caches are synced for garbage collector
I0528 06:12:19.659762       1 garbagecollector.go:155] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
I0528 06:12:19.685275       1 shared_informer.go:247] Caches are synced for garbage collector
I0528 06:12:20.803481       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:21.452319       1 event.go:294] "Event occurred" object="application/rpa-data-collect-meishan" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled down replica set rpa-data-collect-meishan-765bd77d54 to 0"
I0528 06:12:21.487730       1 event.go:294] "Event occurred" object="application/rpa-data-collect-meishan-765bd77d54" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: rpa-data-collect-meishan-765bd77d54-2cbk7"
I0528 06:12:26.047795       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:26.929053       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:27.935853       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:35.957399       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:35.983207       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:35.992803       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:35.993572       1 event.go:294] "Event occurred" object="kubesphere-system/openpitrix-import-job" kind="Job" apiVersion="batch/v1" type="Normal" reason="Completed" message="Job completed"
I0528 06:12:36.004219       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:36.029378       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
I0528 06:12:36.111154       1 job_controller.go:453] enqueueing job kubesphere-system/openpitrix-import-job
bzkj@master-1:~/20250528$ kubectl get rs -n application
NAME                                      DESIRED   CURRENT   READY   AGE
lpc-admin-management-service-79cb448cb8   1         1         1       63s
rpa-data-collect-meishan-545fcc6696       0         0         0       312d
rpa-data-collect-meishan-57998bfdb6       1         1         1       63s
rpa-data-collect-meishan-587b66d649       0         0         0       312d
rpa-data-collect-meishan-64bf4fc5bb       0         0         0       312d
rpa-data-collect-meishan-66f645fc7b       0         0         0       313d
rpa-data-collect-meishan-68b6f69b87       0         0         0       314d
rpa-data-collect-meishan-765bd77d54       0         0         0       261d
rpa-data-collect-meishan-77ff69b9ff       0         0         0       264d
rpa-data-collect-meishan-866d4594b9       0         0         0       313d
rpa-data-collect-meishan-cb596d659        0         0         0       312d
rpa-data-collect-meishan-d97596dbb        0         0         0       313d
bzkj@master-1:~/20250528$ kubectl get pod -n application
NAME                                            READY   STATUS    RESTARTS   AGE
lpc-admin-management-service-79cb448cb8-ts7xl   1/1     Running   0          70s
rpa-data-collect-meishan-57998bfdb6-ltnjv       1/1     Running   0          70s
bzkj@master-1:~/20250528$

就此解决了问题。

 

posted on 2025-05-28 15:40  狂自私  阅读(59)  评论(0)    收藏  举报