故障
1、登录kubesphere报错:request to http://ks-apiserver/oauth/token failed, reason: getaddrinfo EAI_AGAIN ks-apiserver
kubectl -n kube-system edit cm coredns -o yaml #编辑coredns
# forward . /etc/resolv.conf {
# max_concurrent 1000
# }
kubectl delete pod -n kube-system $ (kubectl get pod -n kube-system | grep coredns | awk '{print $ 1}') #删除coredns的pod重启pod
vim /etc/resolve.conf
nameserver 8.8.8.8
nameserver 1.1.1.1
2、main.go:39: failed to initialize: dial tcp: lookup nightingale-database on 169.254.25.10:53: no such host
也是修改coredns
kubectl -n kube-system edit cm coredns -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
loop
cache 30
reload
loadbalance
}
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"Corefile":".:53 {\n errors\n health {\n lameduck 5s\n }\n ready\n kubernetes cluster.local in-addr.arpa ip6.arpa {\n pods insecure\n fallthrough in-addr.arpa ip6.arpa\n ttl 30\n }\n prometheus :9153\n forward . /etc/resolv.conf {\n prefer_udp\n max_concurrent 1000\n }\n cache 30\n loop\n reload\n loadbalance\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"EnsureExists"},"name":"coredns","namespace":"kube-system"}}
creationTimestamp: "2025-04-23T10:12:47Z"
labels:
addonmanager.kubernetes.io/mode: EnsureExists
name: coredns
namespace: kube-system
resourceVersion: "313641"
uid: d905e013-1661-49b2-b2d9-1abfecec6294
大部分解析问题都和coredns相关
3、访问所有的本地nodeport都失效
查看svc
检查kubeproxy
Failed to get properties: Connection timed out
检查服务器资源 这一步很重要,如果不能调度就会一直异常
4、coredns探针失败
coredns报错Readiness probe failed: Get "http://10.233.106.12:8181/ready": dial tcp 10.233.106.12:8181: connect: connection refused
就绪探针配置检查
kubectl get deployment -n kube-system coredns -o yaml
readinessProbe:
httpGet:
path: /ready
port: 8181
initialDelaySeconds: 3
periodSeconds: 5
coredns配置:
Corefile: |
.:53 {
ready 8181 # 确保存在此行
# 其他插件(如 errors, health, kubernetes 等)
}
查看 Events
部分是否有 OOMKilled
或资源争用提示 ,如果有需要增加资源
5、etcd突然的故障
[root@master01 ~]# ETCDCTL_API=3 etcdctl endpoint status
{"level":"warn","ts":"2025-04-29T22:28:39.75038+0800","logger":"etcd-client","caller":"v3@v3.5.13/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000474000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"error reading server preface: EOF\""}
Failed to get the status of endpoint 127.0.0.1:2379 (context deadline exceeded)