k8s删除ns失败,状态显示terminiting
* 人为事件导致无法删除命名空间
问题背景
背景是这样的,我有一套测试用的K8S集群,发现无法正常删除命名空间了,一直处于Terminating状态,强制删除也不行。于是,再次手动创建了一个名为“test-b”的命名空间,同样也是不能正常删除。于是,展开了排查。不过,查到最后,发现是个毫无技术含量的“乌龙问题”。结果不重要,重要的是我想把这个过程分享一下。
排查过程
- 正常删除命名空间时,一直处于阻塞状态,只能Ctrl+C掉
[root@k8s-b-master ~]# kubectl delete ns dev namespace "dev" deleted
- 查看状态一直处于Terminating状态
[root@k8s-b-master ~]# kubectl get ns dev Terminating 18h
- 尝试强制删除,也是一直处于阻塞状态,也只能Ctrl+C掉
[root@k8s-b-master ~]# kubectl delete namespace dev --grace-period=0 --force
- 查看详细信息发现
[root@k8s-b-master ~]# kubectl describe ns dev Name: test-b Labels: kubernetes.io/metadata.name=dev Annotations: <none> Status: Terminating Conditions: Type Status LastTransitionTime Reason Message ---- ------ ------------------ ------ ------- NamespaceDeletionDiscoveryFailure True Fri, 05 May 2023 14:06:52 +0800 DiscoveryFailed Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request NamespaceDeletionGroupVersionParsingFailure False Fri, 05 May 2023 14:06:52 +0800 ParsedGroupVersions All legacy kube types successfully parsed NamespaceDeletionContentFailure False Fri, 05 May 2023 14:06:52 +0800 ContentDeleted All content successfully deleted, may be waiting on finalization NamespaceContentRemaining False Fri, 05 May 2023 14:06:52 +0800 ContentRemoved All content successfully removed NamespaceFinalizersRemaining False Fri, 05 May 2023 14:06:52 +0800 ContentHasNoFinalizers All content-preserving finalizers finished No resource quota. No LimitRange resource.
问题出在这里:DiscoveryFailed:Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
难道API Server出问题了?于是继续排查。
- 检查API Server是否正常运行
[root@k8s-b-master ~]# kubectl get componentstatus Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR controller-manager Healthy ok etcd-0 Healthy {"health":"true","reason":""} scheduler Healthy ok
从输出来看,controller-manager、scheduler和etcd集群都处于正常状态(Healthy)。
- 继续检查API Server的日志看看是否有错误或异常
# 获取API Server Pod的名称: [root@k8s-b-master ~]# kubectl get pods -n kube-system -l component=kube-apiserver -o name pod/kube-apiserver-k8s-b-master # 查看API Server的日志: [root@k8s-b-master ~]# kubectl get pods -n kube-system -l component=kube-apiserver -o name pod/kube-apiserver-k8s-b-master [root@k8s-b-master ~]# kubectl logs -n kube-system kube-apiserver-k8s-b-master ... ... W0506 01:00:12.965627 1 handler_proxy.go:105] no RequestInfo found in the context E0506 01:00:12.965711 1 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] I0506 01:00:12.965722 1 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue. W0506 01:00:12.968678 1 handler_proxy.go:105] no RequestInfo found in the context E0506 01:00:12.968709 1 controller.go:113] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: Error, could not get list of group versions for APIService I0506 01:00:12.968719 1 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue. I0506 01:00:43.794546 1 controller.go:616] quota admission added evaluator for: endpointslices.discovery.k8s.io I0506 01:00:44.023629 1 controller.go:616] quota admission added evaluator for: endpoints W0506 01:01:12.965985 1 handler_proxy.go:105] no RequestInfo found in the context E0506 01:01:12.966062 1 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] I0506 01:01:12.966069 1 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue. W0506 01:01:12.969496 1 handler_proxy.go:105] no RequestInfo found in the context E0506 01:01:12.969527 1 controller.go:113] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: Error, could not get list of group versions for APIService ... ... # 后面都是这样的Log ...
从输出日志来看,问题似乎与metrics.k8s.io/v1beta1有关,这个API被用于收集Kubernetes集群的度量数据。可能是因为度量服务器(metrics-server)出现故障,无法满足API Server的请求,导致API Server无法处理请求。
- k8s默认是没有metrics-server组件的呀,还是看看:
[root@k8s-b-master ~]# kubectl get pods -n kube-system -l k8s-app=metrics-server No resources found in kube-system namespace.
be-system命名空间中没有找到标签为k8s-app=metrics-server的Pod,这很正常呀,K8S本来就是默认没有安装Metrics Server 组件的,为什么现在又依赖了?
查到这里,我突然想起了前段时间部署过kube-prometheus,当时kube-state-metrics拉取镜像失败没有正常运行,因为是临时测试环境,后来就没管了,时间一长居然把这事给忘了。
于是看了一下,发现还真是:
[root@k8s-b-master ~]# kubectl get pod -n monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 14 (30m ago) 5d19h alertmanager-main-1 2/2 Running 14 (29m ago) 5d19h alertmanager-main-2 2/2 Running 14 (29m ago) 5d19h blackbox-exporter-69f4d86566-wn6q7 3/3 Running 21 (30m ago) 5d19h grafana-56c4977497-2rjmt 1/1 Running 7 (29m ago) 5d19h kube-state-metrics-56f8746666-lsps6 2/3 ImagePullBackOff 14 (29m ago) 5d19h # 拉取镜像失败导致没有正常运行 node-exporter-d8c5k 2/2 Running 14 (30m ago) 5d19h node-exporter-gvfx2 2/2 Running 14 (30m ago) 5d19h node-exporter-gxccx 2/2 Running 14 (29m ago) 5d19h node-exporter-h292z 2/2 Running 14 (30m ago) 5d19h node-exporter-mztj6 2/2 Running 14 (29m ago) 5d19h node-exporter-rvfz6 2/2 Running 14 (29m ago) 5d19h node-exporter-twg9q 2/2 Running 13 (29m ago) 5d19h prometheus-adapter-77f56b865b-76nzk 0/1 ImagePullBackOff 0 5d19h prometheus-adapter-77f56b865b-wbcwl 0/1 ImagePullBackOff 0 5d19h prometheus-k8s-0 2/2 Running 14 (30m ago) 5d19h prometheus-k8s-1 2/2 Running 14 (29m ago) 5d19h prometheus-operator-788dd7cb76-85zwj 2/2 Running 14 (29m ago) 5d19h反正也不用这套环境了,干它:
[root@k8s-b-master ~]# kubectl delete -f kube-prometheus-main/manifests/
[root@k8s-b-master ~]# kubectl delete -f kube-prometheus-main/manifests/setup/
再次查看命名空间,test-b这个命名空间也随之能正常删除掉了,问题解决:
最后的觉悟
结合官方文档相关资料和自己平常的经验反思了一下这个事情,kube-state-metrics 组件是负责监控 K8S 集群的状态,并且它会定期获取集群内各个资源的指标数据,这些指标数据会被 Metrics Server 组件使用。当 kube-state-metrics 组件无法正常工作时,Metrics Server 组件就无法获取到指标数据,从而导致 Metrics Server 组件无法正常运行。在 K8S 集群中,很多组件都会使用 Metrics Server 组件提供的指标数据,例如 HPA、kubelet 等。如果 Metrics Server 组件无法正常运行,可能会导致其他组件出现问题,包括删除命名空间时提示错误。也就是说 Metrics Server 组件无法正常运行,导致了API Server组件在处理其它一些请求时可能会失败,从而发生了无法正常删除命名空间的情况。