【AIOps?】基于DeepSeek逐步分析k8s集群故障记录(集群因特殊网络环境,禁用公网网卡eth1后全面崩溃) - 教程
192.168.1.100 k8s服务器停止eth1后集群故障分析处理
丢失了默认路由
丢失了DNS服务器
查看节点的详细信息,找到 InternalIP
[root@k8s-192-168-1-100 ~]# kubectl describe node k8s-192-168-1-100
Name: k8s-192-168-1-100
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=k8s-192-168-1-100
kubernetes.io/os=linux
kubernetes.io/role=master
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 25 Sep 2025 16:18:06 +0800
Taints:
Unschedulable: false
Lease:
HolderIdentity: k8s-192-168-1-100
AcquireTime:
RenewTime: Fri, 31 Oct 2025 09:04:21 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 30 Oct 2025 17:40:05 +0800 Thu, 30 Oct 2025 17:40:05 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Fri, 31 Oct 2025 09:04:11 +0800 Thu, 23 Oct 2025 10:07:49 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 31 Oct 2025 09:04:11 +0800 Thu, 23 Oct 2025 10:07:49 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 31 Oct 2025 09:04:11 +0800 Thu, 23 Oct 2025 10:07:49 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 31 Oct 2025 09:04:11 +0800 Thu, 23 Oct 2025 10:07:49 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.1.100
Hostname: k8s-192-168-1-100
Capacity:
cpu: 16
ephemeral-storage: 437640768Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31571840Ki
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 403329731122
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31264640Ki
pods: 110
System Info:
Machine ID: 0fd67226186546c7933699615ffe86ba
System UUID: b71cb723-164f-e445-bdd3-14cff4dfc0a4
Boot ID: 98aabaf9-c9bc-4845-82cd-e1b459ee596a
Kernel Version: 5.10.0-216.0.0.115.oe2203sp4.x86_64
OS Image: openEuler 22.03 (LTS-SP4)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://2.1.1
Kubelet Version: v1.33.1
Kube-Proxy Version:
PodCIDR: 172.20.0.0/24
PodCIDRs: 172.20.0.0/24
Non-terminated Pods: (12 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-kube-controllers-647ddc7bfd-www6d 0 (0%) 0 (0%) 0 (0%) 0 (0%) 35d
kube-system calico-node-z8lks 250m (1%) 0 (0%) 0 (0%) 0 (0%) 35d
kube-system coredns-5c4d969fb-lsnb9 100m (0%) 0 (0%) 70Mi (0%) 500Mi (1%) 15h
kube-system metrics-server-74f6d6fdd5-pt9dw 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 15h
kube-system node-local-dns-zdb22 25m (0%) 0 (0%) 5Mi (0%) 0 (0%) 35d
kube-system traefik-ingress-controller-jlc4g 100m (0%) 500m (3%) 50Mi (0%) 200Mi (0%) 13h
kubernetes-dashboard kubernetes-dashboard-api-8458b5cfdc-tdrsf 100m (0%) 250m (1%) 200Mi (0%) 400Mi (1%) 15h
kubernetes-dashboard kubernetes-dashboard-auth-5b7f84fcd6-hzcp8 100m (0%) 250m (1%) 200Mi (0%) 400Mi (1%) 34d
kubernetes-dashboard kubernetes-dashboard-kong-648658d45f-2qps7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d22h
kubernetes-dashboard kubernetes-dashboard-metrics-scraper-84cdc96677-m5xww 100m (0%) 250m (1%) 200Mi (0%) 400Mi (1%) 34d
kubernetes-dashboard kubernetes-dashboard-web-755f6d8c86-9ks4j 100m (0%) 250m (1%) 200Mi (0%) 400Mi (1%) 34d
lygame-service lygamelobbyserverv2-655d599f9f-4s8r8 250m (1%) 500m (3%) 512Mi (1%) 1Gi (3%) 12h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1225m (7%) 2 (12%)
memory 1637Mi (5%) 3324Mi (10%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Internal IP正确
路由表
[root@k8s-192-168-1-100 ~]# ip route show
default via 192.168.0.1 dev eth1 proto dhcp metric 101
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.18.0.0/16 dev br-576d7ac025d3 proto kernel scope link src 172.18.0.1 linkdown
172.19.0.0/16 dev br-927deafb5155 proto kernel scope link src 172.19.0.1
blackhole 172.20.18.64/26 proto bird
172.20.18.79 dev calicd0fb857d5c scope link
172.20.18.83 dev cali64c57a3c4b8 scope link
172.20.18.88 dev cali64051cfd375 scope link
172.20.18.90 dev caliddf63ee0f2c scope link
172.20.18.99 dev calib8efd0bcce8 scope link
172.20.18.101 dev cali68be5e0bde2 scope link
172.20.18.104 dev cali62f5f0ab912 scope link
172.20.18.109 dev calif85593b14aa scope link
172.20.18.116 dev cali8d712d27eff scope link
192.168.0.0/24 dev eth1 proto kernel scope link src 192.168.0.245 metric 101
192.168.4.0/24 dev eth0 proto kernel scope link src 192.168.1.100 metric 100
找到了!这就是问题的根源!
ip route show 的输出给出了决定性的证据,它完美地解释了所有现象。
根本原因分析
让我们来解读这张路由表:
default via 192.168.0.1 dev eth1
这是最致命的一条。它的意思是:操作系统级别的默认路由指向了eth1网卡。任何目的地址不在此路由表中其他条目里的数据包,都会被发往192.168.0.1,并且从eth1接口出去。192.168.4.0/24 dev eth0 ...
这条路由确保了eth0网段内部(比如你的宿主机访问其他内网机器)的通信是正常的。172.20.x.x dev calixxx ...
这些是 Calico 为节点上的 Pod 创建的路由,确保了本节点 Pod 之间的通信是正常的。
故障是如何发生的?
- 节点身份 (
kubelet):kubelet正确地使用了eth0的 IP (192.168.1.100) 向 API Server 注册。所以kubectl describe node看到的是eth0的 IP。这部分是正常的。 - 网络路径 (Pod 流量): 当 CoreDNS Pod 尝试连接 API Server 的虚拟 IP (
10.68.0.1) 时,这个 IP 地址不在任何具体的路由条目中(它不是 Pod IP,不是docker0网段,也不是eth0/eth1的网段)。 - 匹配默认路由: 根据最长匹配原则,数据包找不到更具体的路由,最终匹配了那条
default via ... dev eth1的路由。 - 流量走向: 因此,CoreDNS Pod 发出的连接请求,被操作系统强制从
eth1网卡发出。然而,eth1在当时是DOWN的,所以连接请求发出后石沉大海,导致no route to host错误。当你把eth1UP起来后,这条路径通了,故障就消失了。
结论:问题出在操作系统的默认路由上,而不是 Kubernetes 内部。
补充,192.168.4.0/24 网络是依靠交换机组建的一个局域网。没有路由器,即:根本不存在默认路由
感谢这个关键补充!这个信息让整个问题的逻辑链条变得无比清晰。
你说的“没有路由器,即:根本不存在默认路由”完美地解释了一切,并且它指向了唯一正确的解决方案。
逻辑闭环:为什么问题会发生?
eth0(192.168.1.100): 位于一个纯二层交换网络192.168.4.0/24。这个网络是“死胡同”,它只能连接到同一网段的设备,没有出口到其他任何网络(比如10.68.0.0/24或公网)。eth1(192.168.0.245): 连接到有路由器的网络,拥有真正的默认路由 (default via 192.168.0.1),是这台服务器的唯一网络出口。- Kubernetes 的身份与路径:
kubelet聪明地选择了eth0的 IP 作为节点身份,因为这个 IP 是内网稳定的标识。- 但是,当 Pod 需要访问一个不在
192.168.4.0/24网段的地址时(比如 API Server 的10.68.0.1),操作系统别无选择,必须通过eth1的默认路由出去。eth0根本没有能力把数据包发到别的地方去。
所以,你的集群从一开始就存在一个设计上的矛盾:用一个没有出口的网卡地址作为身份,却依赖另一个有出口的网卡来完成所有跨网段的通信。
这就像一个家庭有两个门牌号,一个(eth0)是内部的户籍地址,另一个(eth1)是真正能通往外界的大门。平时大门开着,一切正常。一旦大门(eth1)关了,人就出不去了。
唯一正确的解决方案:策略路由
既然 eth0 没有默认路由,我们不能简单地添加一个。我们需要更高级的工具:策略路由。
策略路由允许我们创建规则,告诉操作系统:“来自特定源地址的数据包,请使用特定的路由表,即使它的目的地是默认的”。
我们的目标是:
所有从 Kubernetes Pod IP 地址发出的流量,都强制从 eth0 走,并被引导到正确的下一跳(可能是其他节点或网络设备)。
但是等等,你说了 eth0 没有路由器。这意味着数据包从 eth0 出去后,需要其他 K8s 节点或网络中的其他设备来帮忙转发。
这通常指向一个标准的 K8s 网络模式:Pod 网络流量通过节点间的隧道(如 VXLAN)或直接路由来交换。
具体操作步骤
我们需要做两件事:
- 修复 CNI 配置:确保 Calico 知道应该使用
eth0来建立节点间的网络隧道。 - 修复主机路由:即使 CNI 配置正确,主机本身的默认路由也可能干扰。我们需要用策略路由来彻底解决这个问题。
第 1 步:修复 Calico 配置(治本)
这是最重要的一步。我们需要明确告诉 Calico,不要自动检测,而是使用 eth0 的 IP。
kubectl edit configmap calico-config -n kube-system
找到 cni_network_config 部分,在 ipam 字段里,添加或修改 ip_autodetection_method。
# ... 其他配置 ...
"cni_network_config": {
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "calico",
"ipam": {
"type": "calico-ipam",
"assign_ipv4": "true",
"ipv4_pools": ["172.20.0.0/24"], // 确认这个是你的 Pod CIDR
"ip_autodetection_method": "interface=eth0" // <-- 添加或修改这一行!
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}
# ... 其他配置 ...
修改后,你需要重启所有 Calico 的 Pod 来使配置生效:
kubectl delete pods -n kube-system -l k8s-app=calico-node
等待它们重新启动并变为 Running 状态。
第 2 步:配置策略路由(兜底-暂时不操作)
为了确保万无一失,我们可以在操作系统层面设置策略路由,防止任何意外的流量泄露到 eth1。
创建新的路由表:
# 编辑 /etc/iproute2/rt_tables,添加一个新表 sudo vim /etc/iproute2/rt_tables # 在文件末尾添加一行 100 k8s-table为 k8s-table 添加默认路由:
这个路由表需要一个默认路由。由于eth0没有网关,我们不能写via。在 Calico 的 BGP 模式下,节点之间会互相学习路由。这里我们先假设 Calico 会处理节点间的路由。如果不行,我们需要指向一个能转发流量的核心节点。# 暂时我们先创建一个空的规则,后续根据情况调整 # 在某些 Calico 配置下,这一步可能不是必须的,但加上更保险 # 如果你的集群有多个节点,可以尝试指向另一个 master 节点的 eth0 IP # sudo ip route add default via <另一个 master 节点的 eth0 IP> dev eth0 table k8s-table创建策略规则:
这是核心。我们告诉系统:所有源地址是 Kubernetes Pod CIDR (172.20.0.0/24) 的包,都去查k8s-table这个路由表。# 添加规则 sudo ip rule add from 172.20.0.0/24 table k8s-table使配置永久化:
将上述ip route和ip rule命令添加到网络启动脚本中,例如/etc/rc.local,或者根据你的发行版使用NetworkManager的 dispatcher 脚本。
第3步:停止eth1
简化版建议
对于你的情况,第 1 步(修复 Calico 配置)通常就足够了。因为 Calico 的 BGP 模式会自动在所有节点之间建立路由,告诉每个节点如何到达其他节点上的 Pod 网段。当你强制 Calico 使用 eth0 后,它会自动通过 eth0 建立这些路由,从而绕开了操作系统的默认路由问题。
所以,请先执行第 1 步,修改 Calico 的 ConfigMap 并重启 Pod。
修改后,再次 down 掉 eth1,然后观察 kubectl get pods -n kube-system 和 CoreDNS 的日志。如果问题解决,那就万事大吉。如果问题依旧,我们再深入执行第 2 步的策略路由。
这次的排查,我们从应用层、K8s层,一直深入到了操作系统的核心路由机制,这是一个非常典型的复杂网络故障诊断过程。你的补充信息是解决问题的关键!
重启后大规模故障(没有默认路由!)
我在维护内网环境部署的k8s集群。之前集群有eth0(192.168.1.100 无公网,无路由器,即:无网关地址),eth1(192.168.0.3 有路由,有公网)。集群地址是192.168.1.100,安装完毕后。为了网络安全,我禁用了eth1,重启服务器后。集群故障。接下来我会提供日志,请基于日志指导我一步步恢复集群。注意,每次交互只处理一个问题,我给你提供日志,你提供一个指令。如此循环。这次的日志是:
[root@k8s-192-168-1-100 ~]# kubectl describe pod -n kube-system calico-node-h4x7g
Name: calico-node-h4x7g
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: calico-node
Node: k8s-192-168-1-100/192.168.1.100
Start Time: Fri, 31 Oct 2025 10:46:46 +0800
Labels: controller-revision-hash=68c67d9f8f
k8s-app=calico-node
pod-template-generation=1
Annotations:
Status: Running
IP: 192.168.1.100
IPs:
IP: 192.168.1.100
Controlled By: DaemonSet/calico-node
Init Containers:
install-cni:
Container ID: containerd://1de7c4c0fcc7dc746a7611d3e1bc6f37d99a98ce43132f2b7dce8b0d580d009e
Image: easzlab.io.local:5000/easzlab/cni:v3.28.4
Image ID: easzlab.io.local:5000/easzlab/cni@sha256:55818ec0c4bdfc1467805659b04eeddefc7c8539e192944c4141d4363a8fab4e
Port:
Host Port:
Command:
/opt/cni/bin/install
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 31 Oct 2025 10:46:46 +0800
Finished: Fri, 31 Oct 2025 10:46:46 +0800
Ready: True
Restart Count: 0
Environment Variables from:
kubernetes-services-endpoint ConfigMap Optional: true
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: Optional: false
ETCD_ENDPOINTS: Optional: false
CNI_MTU: Optional: false
SLEEP: false
Mounts:
/calico-secrets from etcd-certs (rw)
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7mr6d (ro)
mount-bpffs:
Container ID: containerd://1751c8a4751566eef436250ebdc9a465f813e2390cf21d6de98ce6b4efff0977
Image: easzlab.io.local:5000/easzlab/node:v3.28.4
Image ID: easzlab.io.local:5000/easzlab/node@sha256:cec640f3131eb91fece8b7dc14f5241b5192fe7faa107f91e2497c09332b96c8
Port:
Host Port:
Command:
calico-node
-init
-best-effort
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 31 Oct 2025 10:46:48 +0800
Finished: Fri, 31 Oct 2025 10:46:48 +0800
Ready: True
Restart Count: 0
Environment:
Mounts:
/nodeproc from nodeproc (ro)
/sys/fs from sys-fs (rw)
/var/run/calico from var-run-calico (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7mr6d (ro)
Containers:
calico-node:
Container ID: containerd://3a477b600abe7c4d68d99791b0b583f01d38963bccfe3f35d6fe53f0b04cb18f
Image: easzlab.io.local:5000/easzlab/node:v3.28.4
Image ID: easzlab.io.local:5000/easzlab/node@sha256:cec640f3131eb91fece8b7dc14f5241b5192fe7faa107f91e2497c09332b96c8
Port:
Host Port:
State: Running
Started: Fri, 31 Oct 2025 10:46:49 +0800
Ready: True
Restart Count: 0
Requests:
cpu: 250m
Liveness: exec [/bin/calico-node -felix-live -bird-live] delay=10s timeout=10s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=10s period=10s #success=1 #failure=3
Environment Variables from:
kubernetes-services-endpoint ConfigMap Optional: true
Environment:
ETCD_ENDPOINTS: Optional: false
ETCD_CA_CERT_FILE: Optional: false
ETCD_KEY_FILE: Optional: false
ETCD_CERT_FILE: Optional: false
CALICO_K8S_NODE_REF: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: Optional: false
CLUSTER_TYPE: k8s,bgp
IP: autodetect
IP_AUTODETECTION_METHOD: can-reach=192.168.1.100
CALICO_IPV4POOL_IPIP: Always
FELIX_IPINIPMTU: Optional: false
FELIX_VXLANMTU: Optional: false
FELIX_WIREGUARDMTU: Optional: false
CALICO_IPV4POOL_CIDR: 172.20.0.0/16
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_HEALTHENABLED: true
FELIX_KUBENODEPORTRANGES: 30000:32767
FELIX_PROMETHEUSMETRICSENABLED: false
Mounts:
/calico-secrets from etcd-certs (rw)
/host/etc/cni/net.d from cni-net-dir (rw)
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/sys/fs/bpf from bpffs (rw)
/var/lib/calico from var-lib-calico (rw)
/var/log/calico/cni from cni-log-dir (ro)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7mr6d (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType: DirectoryOrCreate
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType: DirectoryOrCreate
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
sys-fs:
Type: HostPath (bare host directory volume)
Path: /sys/fs/
HostPathType: DirectoryOrCreate
bpffs:
Type: HostPath (bare host directory volume)
Path: /sys/fs/bpf
HostPathType: Directory
nodeproc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType: DirectoryOrCreate
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
cni-log-dir:
Type: HostPath (bare host directory volume)
Path: /var/log/calico/cni
HostPathType:
etcd-certs:
Type: Secret (a volume populated by a Secret)
SecretName: calico-etcd-secrets
Optional: false
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
kube-api-access-7mr6d:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 54s default-scheduler Successfully assigned kube-system/calico-node-h4x7g to k8s-192-168-1-100
Normal Pulled 54s kubelet Container image "easzlab.io.local:5000/easzlab/cni:v3.28.4" already present on machine
Normal Created 54s kubelet Created container: install-cni
Normal Started 54s kubelet Started container install-cni
Normal Pulled 53s kubelet Container image "easzlab.io.local:5000/easzlab/node:v3.28.4" already present on machine
Normal Created 53s kubelet Created container: mount-bpffs
Normal Started 52s kubelet Started container mount-bpffs
Normal Pulled 52s kubelet Container image "easzlab.io.local:5000/easzlab/node:v3.28.4" already present on machine
Normal Created 52s kubelet Created container: calico-node
Normal Started 51s kubelet Started container calico-node
Warning Unhealthy 49s (x2 over 50s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
deppseek
根据日志分析,Calico Pod 已经成功启动,但就绪探针失败,原因是 BIRD(BGP 守护进程)无法连接到控制套接字。这通常与节点网络配置变化有关。
请检查当前节点的网络路由表,运行以下命令并提供输出:
ip route show
输入
[root@k8s-192-168-1-100 ~]# ip route show
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.18.0.0/16 dev br-576d7ac025d3 proto kernel scope link src 172.18.0.1 linkdown
172.19.0.0/16 dev br-927deafb5155 proto kernel scope link src 172.19.0.1
blackhole 172.20.18.64/26 proto bird
172.20.18.118 dev cali572b382024f scope link
172.20.18.119 dev calib8efd0bcce8 scope link
172.20.18.120 dev cali62f5f0ab912 scope link
172.20.18.121 dev calif85593b14aa scope link
172.20.18.122 dev calicd0fb857d5c scope link
172.20.18.123 dev cali7c068dbd676 scope link
172.20.18.124 dev cali68be5e0bde2 scope link
172.20.18.125 dev cali64c57a3c4b8 scope link
172.20.18.126 dev caliddf63ee0f2c scope link
172.20.18.127 dev cali64051cfd375 scope link
192.168.4.0/24 dev eth0 proto kernel scope link src 192.168.1.100 metric 100
deepseek
路由表显示缺少默认路由。由于禁用了 eth1(原本有网关的接口),现在系统没有默认路由,这会导致网络通信问题。
请检查当前网络接口状态,运行以下命令:
ip addr show
deepseek
我看到 eth1 接口处于 DOWN 状态,而 eth0 是 UP 状态。现在需要为 eth0 添加默认路由来恢复网络连接。
请运行以下命令添加默认路由(假设你的网络网关是 192.168.4.1,如果不是请告诉我正确的网关):
ip route add default via 192.168.4.1 dev eth0
如果不知道网关地址,可以先尝试:
ip route add default via 192.168.4.1 dev eth0 || echo "请提供正确的网关地址"
如果命令失败,请告诉我你的网络中正确的网关地址是什么。
输入
注意:我的内网192.168.4.0/24 网络是基于交换机组件的网络,没有路由器。也就没有默认路由。是否一定要接入一个路由器?
deepseek
明白了。在没有路由器的二层网络环境中,确实不需要默认路由。Calico 应该能在这种环境中正常工作。
让我们检查 Calico 的 BIRD 进程状态,运行以下命令:
kubectl logs -n kube-system calico-node-h4x7g -c calico-node | grep -i bird
输入
[root@k8s-192-168-1-100 ~]# kubectl logs -n kube-system calico-node-h4x7g -c calico-node | grep -i bird
bird: Unable to open configuration file /etc/calico/confd/config/bird6.cfg: No such file or directory
bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory
2025-10-31 02:46:50.215 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird_aggr.cfg has been updated
2025-10-31 02:46:50.215 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird6_aggr.cfg has been updated
2025-10-31 02:46:50.215 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird_ipam.cfg has been updated
2025-10-31 02:46:50.215 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird6_ipam.cfg has been updated
2025-10-31 02:46:50.217 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird6.cfg has been updated
2025-10-31 02:46:50.217 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird.cfg has been updated
bird: device1: Initializing
bird: bird: direct1: Initializingdevice1: Initializing
bird: device1: Starting
bird: direct1: Initializing
bird: device1: Starting
bird: device1: Connected to table master
bird: device1: State changed to feed
bird: direct1: Starting
bird: direct1: Connected to table master
bird: direct1: State changed to feed
bird: Graceful restart started
bird: Graceful restart done
bird: Started
bird: device1: State changed to up
bird: device1: Connected to table masterbird: direct1: State changed to up
bird: device1: State changed to feed
bird: direct1: Starting
bird: direct1: Connected to table master
bird: direct1: State changed to feed
bird: Graceful restart started
bird: Graceful restart done
bird: Started
bird: device1: State changed to up
bird: direct1: State changed to up
deepseek
BIRD 日志显示进程已经正常启动,但之前就绪探针失败可能是由于启动过程中的短暂问题。让我们检查当前 Calico Pod 的就绪状态:
kubectl get pod -n kube-system calico-node-h4x7g -o wide
同时,也检查其他核心组件的状态:
kubectl get pods -n kube-system -o wide
输入
[root@k8s-192-168-1-100 ~]# kubectl logs -n kube-system calico-node-h4x7g -c calico-node | grep -i bird
bird: Unable to open configuration file /etc/calico/confd/config/bird6.cfg: No such file or directory
bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory
2025-10-31 02:46:50.215 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird_aggr.cfg has been updated
2025-10-31 02:46:50.215 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird6_aggr.cfg has been updated
2025-10-31 02:46:50.215 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird_ipam.cfg has been updated
2025-10-31 02:46:50.215 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird6_ipam.cfg has been updated
2025-10-31 02:46:50.217 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird6.cfg has been updated
2025-10-31 02:46:50.217 [INFO][81] confd/resource.go 290: Target config /etc/calico/confd/config/bird.cfg has been updated
bird: device1: Initializing
bird: bird: direct1: Initializingdevice1: Initializing
bird: device1: Starting
bird: direct1: Initializing
bird: device1: Starting
bird: device1: Connected to table master
bird: device1: State changed to feed
bird: direct1: Starting
bird: direct1: Connected to table master
bird: direct1: State changed to feed
bird: Graceful restart started
bird: Graceful restart done
bird: Started
bird: device1: State changed to up
bird: device1: Connected to table masterbird: direct1: State changed to up
bird: device1: State changed to feed
bird: direct1: Starting
bird: direct1: Connected to table master
bird: direct1: State changed to feed
bird: Graceful restart started
bird: Graceful restart done
bird: Started
bird: device1: State changed to up
bird: direct1: State changed to up
deepseek
根据Pod状态,你的CoreDNS和其他关键组件都启动失败,这很可能是一个网络问题。既然Calico-node现在已经在运行,我们应该重点关注集群内的网络连通性和DNS解析。
以下是诊断问题的分步计划:
| 步骤 | 检查命令 | 目的/需要关注的内容 |
|---|---|---|
| 1. 检查CoreDNS日志 | kubectl logs -n kube-system coredns-5c4d969fb-lsnb9 | 查看CoreDNS中的错误,通常与网络连接或插件问题有关。 |
| 2. 验证网络配置 | kubectl get ippool -o yaml | 确认Calico的IP池(例如 172.20.0.0/16)没有与主机网络冲突。 |
| 3. 测试Pod网络 | kubectl run test-pod --image=busybox --rm -it --restart=Never -- nslookup kubernetes.default.svc.cluster.local | 从测试Pod验证DNS解析和网络连通性。 |
| 4. 检查Calico配置 | kubectl get configmap -n kube-system calico-config -o yaml | 确保 IP_AUTODETECTION_METHOD 正确设置(例如 can-reach=192.168.1.100)。 |
| 5. 检查已知问题 | 查看Calico版本和 kubectl logs -n kube-system calico-kube-controllers-647ddc7bfd-www6d | 新版Calico可能有bug;同时检查控制器日志中的RBAC或连接错误。 |
️ 后续步骤和潜在解决方案
根据诊断结果,常见的解决方案包括:
- 网络配置错误:如果存在CIDR冲突,可能需要重新配置Calico的IP池或调整自动检测方法。
- Calico版本Bug:如果怀疑是bug(如v3.28中的校验和卸载问题),考虑降级Calico或设置环境变量
FELIX_CHECKSUMOFFLOADENABLED=false。 - RBAC权限问题:如果Calico组件日志中出现"unauthorized"错误,需要修补
calico-nodeClusterRole以添加缺失的API权限。
请先运行第一个检查并分享CoreDNS日志。这将帮助我们确定下一个具体的命令。
输入
[root@k8s-192-168-1-100 ~]# kubectl logs -n kube-system coredns-5c4d969fb-lsnb9
maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
plugin/forward: no nameservers found
注意,一次只沟通一个问题。你提供一个指令,我提供一个指令的输出。不要一次提供多个指令。
deepseek(问题又绕到dns服务器上!)
根据 CoreDNS 日志显示 “plugin/forward: no nameservers found”,这是 DNS 转发配置问题。请检查 CoreDNS 的配置:
kubectl get configmap coredns -n kube-system -o yaml
输入
[root@k8s-192-168-1-100 ~]# kubectl get configmap coredns -n kube-system -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
reload
loadbalance
}
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"Corefile":".:53 {\n errors\n health {\n lameduck 5s\n }\n ready\n kubernetes cluster.local in-addr.arpa ip6.arpa {\n pods insecure\n fallthrough in-addr.arpa ip6.arpa\n ttl 30\n }\n prometheus :9153\n forward . /etc/resolv.conf {\n max_concurrent 1000\n }\n cache 30\n reload\n loadbalance\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"EnsureExists"},"name":"coredns","namespace":"kube-system"}}
creationTimestamp: "2025-09-25T08:22:09Z"
labels:
addonmanager.kubernetes.io/mode: EnsureExists
name: coredns
namespace: kube-system
resourceVersion: "711"
uid: cd7414ce-20ce-43b9-815f-dd365a3d4e5b
[root@k8s-192-168-1-100 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
[root@k8s-192-168-1-100 ~]#
deepseek
问题已明确:CoreDNS 配置指向 /etc/resolv.conf,但该文件为空,没有配置任何 DNS 服务器。
请检查当前节点的网络 DNS 配置:
nmcli dev show | grep DNS
输入
内网没有dns服务器,我可以在这里面写入一条nameserver 8.8.8.8 ,尽管它不可达。这样操作会比较简单。你的意见呢?
deepseek
不建议写入不可达的 DNS 服务器(如 8.8.8.8),因为:
- 每次 DNS 查询都会超时,导致性能问题
- 应用可能会因 DNS 查询超时而出现异常行为
- 集群内部服务发现可能会受影响
更好的解决方案是修改 CoreDNS 配置,移除对上游 DNS 的依赖。请运行:
kubectl edit configmap coredns -n kube-system
将 forward . /etc/resolv.conf 改为 forward . 127.0.0.1:53 或直接使用 loop 插件替代 forward 部分。
输入
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . 127.0.0.1:53 {
max_concurrent 1000
}
cache 30
reload
loadbalance
}
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"Corefile":".:53 {\n errors\n health {\n lameduck 5s\n }\n ready\n kubernetes cluster.local in-addr.arpa ip6.arpa {\n pods insecure\n fallthrough in-addr.arpa ip6.arpa\n ttl 30\n }\n prometheus :9153\n forward . /etc/resolv.conf {\n max_concurrent 1000\n }\n cache 30\n reload\n loadbalance\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"EnsureExists"},"name":"coredns","namespace":"kube-system"}}
creationTimestamp: "2025-09-25T08:22:09Z"
labels:
addonmanager.kubernetes.io/mode: EnsureExists
name: coredns
namespace: kube-system
resourceVersion: "711"
uid: cd7414ce-20ce-43b9-815f-dd365a3d4e5b
[root@k8s-192-168-1-100 ~]# kubectl edit configmap coredns -n kube-system
configmap/coredns edited
我已完成修改。
deepseek
很好!现在需要重启 CoreDNS Pod 来应用新的配置。请删除当前的 CoreDNS Pod,Deployment 会自动创建新的 Pod:
kubectl delete pod -n kube-system coredns-5c4d969fb-lsnb9
输入
等了一分钟,
[root@k8s-192-168-1-100 ~]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-647ddc7bfd-www6d 1/1 Running 7 (52m ago) 35d
calico-node-h4x7g 1/1 Running 0 31m
coredns-5c4d969fb-l6smz 0/1 Running 0 54s
metrics-server-74f6d6fdd5-pt9dw 0/1 CrashLoopBackOff 10 (4m53s ago) 17h
node-local-dns-zdb22 0/1 CrashLoopBackOff 19 (21s ago) 35d
traefik-ingress-controller-jlc4g 0/1 CrashLoopBackOff 12 (4m16s ago) 15h
[root@k8s-192-168-1-100 ~]# kubectl describe pod -n kube-system coredns-5c4d969fb-l6smz
Name: coredns-5c4d969fb-l6smz
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Service Account: coredns
Node: k8s-192-168-1-100/192.168.1.100
Start Time: Fri, 31 Oct 2025 11:17:45 +0800
Labels: k8s-app=kube-dns
pod-template-hash=5c4d969fb
Annotations:
Status: Running
SeccompProfile: RuntimeDefault
IP: 172.20.18.67
IPs:
IP: 172.20.18.67
Controlled By: ReplicaSet/coredns-5c4d969fb
Containers:
coredns:
Container ID: containerd://6a744b25f7b307f1efb9ff8aaabf6f4032c0ad7284ed40fd7bdd9e144d761900
Image: easzlab.io.local:5000/easzlab/coredns:1.12.1
Image ID: easzlab.io.local:5000/easzlab/coredns@sha256:4f7a57135719628cf2070c5e3cbde64b013e90d4c560c5ecbf14004181f91998
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Fri, 31 Oct 2025 11:17:45 +0800
Ready: False
Restart Count: 0
Limits:
memory: 500Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dgzz2 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
kube-api-access-dgzz2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 68s default-scheduler Successfully assigned kube-system/coredns-5c4d969fb-l6smz to k8s-192-168-1-100
Normal Pulled 67s kubelet Container image "easzlab.io.local:5000/easzlab/coredns:1.12.1" already present on machine
Normal Created 67s kubelet Created container: coredns
Normal Started 67s kubelet Started container coredns
Warning Unhealthy 4s (x9 over 66s) kubelet Readiness probe failed: Get "http://172.20.18.67:8181/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 2s kubelet Liveness probe failed: Get "http://172.20.18.67:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
deepseek
检查新 CoreDNS Pod 的日志,看看是否有错误信息:
kubectl logs -n kube-system coredns-5c4d969fb-l6smz
输入
[root@k8s-192-168-1-100 ~]# kubectl logs -n kube-system coredns-5c4d969fb-l6smz
maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.EndpointSlice: Get "https://10.68.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Namespace: Get "https://10.68.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Service: Get "https://10.68.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = d19d726ef175a3c3cdbfa12ac743b87ca2f8e605f29c73f07c3dfe3ba519356aa918de67771d7ae92a496ac2db5ee44f01dd53ff68cb0fa7548e20c93925b234
CoreDNS-1.12.1
linux/amd64, go1.24.1, 707c7c1
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.EndpointSlice: Get "https://10.68.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Namespace: Get "https://10.68.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Service: Get "https://10.68.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Namespace: Get "https://10.68.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Service: Get "https://10.68.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.EndpointSlice: Get "https://10.68.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Namespace: Get "https://10.68.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Service: Get "https://10.68.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.EndpointSlice: Get "https://10.68.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Service: Get "https://10.68.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.Namespace: Get "https://10.68.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.32.3/tools/cache/reflector.go:251: failed to list *v1.EndpointSlice: Get "https://10.68.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.68.0.1:443: connect: no route to host
[ERROR] plugin/kubernetes: Unhandled Error
deepseek
问题明确了:CoreDNS 无法连接到 Kubernetes API 服务器 (10.68.0.1:443),错误是 “no route to host”。这表明 Pod 网络与 API 服务器之间的通信有问题。
检查 Kubernetes 服务的 ClusterIP 和端点:
kubectl get svc kubernetes -o wide
输入
[root@k8s-192-168-1-100 ~]# kubectl get svc kubernetes -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kubernetes ClusterIP 10.68.0.1 443/TCP 35d
[root@k8s-192-168-1-100 ~]# ip route
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.18.0.0/16 dev br-576d7ac025d3 proto kernel scope link src 172.18.0.1 linkdown
172.19.0.0/16 dev br-927deafb5155 proto kernel scope link src 172.19.0.1
blackhole 172.20.18.64/26 proto bird
172.20.18.67 dev caliba4f2dd07ff scope link
172.20.18.118 dev cali572b382024f scope link
172.20.18.120 dev cali62f5f0ab912 scope link
172.20.18.121 dev calif85593b14aa scope link
172.20.18.122 dev calicd0fb857d5c scope link
172.20.18.123 dev cali7c068dbd676 scope link
172.20.18.124 dev cali68be5e0bde2 scope link
172.20.18.125 dev cali64c57a3c4b8 scope link
172.20.18.126 dev caliddf63ee0f2c scope link
172.20.18.127 dev cali64051cfd375 scope link
192.168.4.0/24 dev eth0 proto kernel scope link src 192.168.1.100 metric 100
请注意,我附加了路由信息。这台服务器由于关闭了eth1网卡,路由表的默认路由消失了。
deepseek
现在问题很清楚了:由于禁用了 eth1,默认路由丢失,导致 Pod 无法访问 Kubernetes API 服务器 (10.68.0.1)。
在单节点集群中,我们需要确保节点能够路由到自己的 ClusterIP。检查当前的网络策略和路由:
iptables -t nat -L KUBE-SERVICES | grep 10.68.0.1
输入
[root@k8s-192-168-1-100 ~]# iptables -t nat -L KUBE-SERVICES | grep 10.68.0.1
[root@k8s-192-168-1-100 ~]#
奇怪呢,没有默认路由对这个还有影响?
deepseek
确实有影响。没有默认路由会影响 iptables 规则和网络栈的某些行为。让我们检查 kube-proxy 是否正常运行,因为它负责设置 ClusterIP 的 iptables 规则:
kubectl get pods -n kube-system | grep proxy
输入
[root@k8s-192-168-1-100 ~]# kubectl get pods -n kube-system | grep proxy
[root@k8s-192-168-1-100 ~]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-647ddc7bfd-www6d 1/1 Running 7 (63m ago) 35d
calico-node-h4x7g 1/1 Running 0 42m
coredns-5c4d969fb-l6smz 0/1 Running 6 (31s ago) 11m
metrics-server-74f6d6fdd5-pt9dw 0/1 CrashLoopBackOff 13 (6s ago) 17h
node-local-dns-zdb22 0/1 CrashLoopBackOff 21 (45s ago) 35d
traefik-ingress-controller-jlc4g 0/1 CrashLoopBackOff 16 (2m13s ago) 16h
…
deepseek
当前的检查思路是正确的。CoreDNS 无法就绪的根本原因是它无法连接到 Kubernetes API 服务器(10.68.0.1:443),而 kube-proxy 的缺失会导致 ClusterIP 服务的网络规则(如 iptables)无法正确设置,从而影响 Pod 到 API 服务器的通信。因此,检查 kube-proxy 的状态与解决 CoreDNS 故障直接相关。
请运行以下命令检查 kube-proxy DaemonSet 的状态:
kubectl get daemonset -n kube-system
输入
[root@k8s-192-168-1-100 ~]# kubectl get daemonset -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
calico-node 1 1 1 1 1 kubernetes.io/os=linux 35d
node-local-dns 1 1 0 1 0 35d
traefik-ingress-controller 1 1 0 1 0 2d1h
deepseek
您的提醒很对,我们需要聚焦在解决 CoreDNS 未就绪的问题上。检查思路确实正确,因为 kube-proxy 缺失直接影响 ClusterIP 服务(包括 Kubernetes API)的网络路由。
现在请检查是否有 kube-proxy 相关的配置残留:
kubectl get configmap -n kube-system | grep proxy
输入
[root@k8s-192-168-1-100 ~]# kubectl get configmap -n kube-system | grep proxy
[root@k8s-192-168-1-100 ~]# kubectl get configmap -n kube-system
NAME DATA AGE
calico-config 8 35d
coredns 1 35d
extension-apiserver-authentication 6 35d
kube-apiserver-legacy-service-account-token-tracking 1 35d
kube-root-ca.crt 1 35d
node-local-dns 1 35d
deepseek
确认,当前检查思路完全正确且直接关联 CoreDNS 故障。没有 kube-proxy 配置,这表明该组件可能从未安装或已被删除,这正是 CoreDNS 无法访问 API 服务器的根本原因。
请检查集群的 kube-proxy 安装状态:
kubectl get pods -A | grep -i proxy
输入
[root@k8s-192-168-1-100 ~]# kubectl get pods -A | grep -i proxy
[root@k8s-192-168-1-100 ~]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default nginx-deployment-5499b8bbf6-8spnp 0/1 ImagePullBackOff 0 75m
kube-system calico-kube-controllers-647ddc7bfd-www6d 1/1 Running 7 (73m ago) 35d
kube-system calico-node-h4x7g 1/1 Running 0 52m
kube-system coredns-5c4d969fb-l6smz 0/1 CrashLoopBackOff 7 (4m3s ago) 21m
kube-system metrics-server-74f6d6fdd5-pt9dw 0/1 CrashLoopBackOff 14 (5m1s ago) 17h
kube-system node-local-dns-zdb22 0/1 CrashLoopBackOff 23 (30s ago) 35d
kube-system traefik-ingress-controller-jlc4g 0/1 Running 20 (5s ago) 16h
kubernetes-dashboard kubernetes-dashboard-api-8458b5cfdc-tdrsf 0/1 Error 15 (5m13s ago) 17h
kubernetes-dashboard kubernetes-dashboard-auth-5b7f84fcd6-hzcp8 1/1 Running 4 (73m ago) 35d
kubernetes-dashboard kubernetes-dashboard-kong-648658d45f-2qps7 0/1 CrashLoopBackOff 25 (2m3s ago) 8d
kubernetes-dashboard kubernetes-dashboard-metrics-scraper-84cdc96677-m5xww 1/1 Running 22 (14s ago) 35d
kubernetes-dashboard kubernetes-dashboard-web-755f6d8c86-9ks4j 1/1 Running 4 (73m ago) 35d
lygame-service lygamelobbyserverv2-56769cc57d-2gnpp 0/1 CrashLoopBackOff 13 (2m31s ago) 141m
deepseek
确认您的集群中缺少 kube-proxy 组件,这直接导致 CoreDNS 及其他组件无法正常工作。kube-proxy 负责维护节点上的网络规则,是实现 Kubernetes Service(包括您集群中 IP 为 10.68.0.1 的 API Server 服务)通信的关键组件。
如何检查 kube-proxy DaemonSet
请运行以下命令,检查 kube-proxy 的 DaemonSet 是否存在以及其状态:
kubectl get daemonset -n kube-system -l k8s-app=kube-proxy
请您执行上面的命令并反馈结果。根据结果,我们将确定是 DaemonSet 未安装,还是其 Pod 因配置等问题未能成功调度和运行,并据此找到解决方案。
输入
[root@k8s-192-168-1-100 ~]# kubectl get daemonset -n kube-system -l k8s-app=kube-proxy
No resources found in kube-system namespace.
注意,我的另外一台离线服务器k8s 单节点上面,也是相同输出。并没有故障。
deepseek
您的观察很重要,这说明在没有 kube-proxy 的情况下集群确实可以正常工作。我们需要重新审视问题。
既然您的另一个正常集群也没有 kube-proxy,那么问题可能不在 kube-proxy 上。让我们回到网络连通性的核心问题。
请检查当前节点能否访问 Kubernetes API 服务器:
curl -k https://10.68.0.1:443
输入
[root@k8s-192-168-1-100 ~]# curl -k https://10.68.0.1:443
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "Unauthorized",
"reason": "Unauthorized",
"code": 401
}[root@k8s-192-168-1-100 ~]#
deepseek
很好!这说明节点本身可以访问 Kubernetes API 服务器(返回 401 表示网络连通性正常)。问题在于 Pod 网络无法访问 API 服务器。
让我们检查 Pod 网络到节点网络的路由情况:
ip route get 10.68.0.1 from 172.20.18.67
输入
}[root@k8s-192-168-1-100 ~]# ip route get 10.68.0.1 from 172.20.18.67
RTNETLINK answers: Network is unreachable
[root@k8s-192-168-1-100 ~]# ip route
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.18.0.0/16 dev br-576d7ac025d3 proto kernel scope link src 172.18.0.1 linkdown
172.19.0.0/16 dev br-927deafb5155 proto kernel scope link src 172.19.0.1
blackhole 172.20.18.64/26 proto bird
172.20.18.67 dev caliba4f2dd07ff scope link
172.20.18.118 dev cali572b382024f scope link
172.20.18.120 dev cali62f5f0ab912 scope link
172.20.18.121 dev calif85593b14aa scope link
172.20.18.122 dev calicd0fb857d5c scope link
172.20.18.123 dev cali7c068dbd676 scope link
172.20.18.124 dev cali68be5e0bde2 scope link
172.20.18.125 dev cali64c57a3c4b8 scope link
172.20.18.126 dev caliddf63ee0f2c scope link
172.20.18.127 dev cali64051cfd375 scope link
192.168.4.0/24 dev eth0 proto kernel scope link src 192.168.1.100 metric 100
deepseek
问题确认了:从 Pod 网络 (172.20.18.67) 到 Kubernetes API 服务器 (10.68.0.1) 的网络不可达。这是 CoreDNS 无法启动的根本原因。
我们需要检查 Calico 的网络配置,特别是服务 CIDR 的路由设置。请检查 Calico 的 IP 池配置:
kubectl get ippool -o yaml
输入
[root@k8s-192-168-1-100 ~]# kubectl get ippool -o yaml
error: the server doesn't have a resource type "ippool"
请注意,另外一条单机部署的k8s集群,同样的版本,工作正常。并没有ippool,它的信息是:
[root@k8s-10-0-2-15 ~]# kubectl get daemonset -A
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 37d
kube-system node-local-dns 1 1 1 1 1 37d
[root@k8s-10-0-2-15 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 10.0.2.3
[root@k8s-10-0-2-15 ~]# ip route
default via 10.0.2.2 dev enp0s3 proto dhcp metric 100
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
blackhole 172.20.100.192/26 proto bird
172.20.100.241 dev cali0764e1639a9 scope link
172.20.100.242 dev calic408eaf7f88 scope link
172.20.100.243 dev cali66b4acfa479 scope link
192.168.4.0/24 dev enp0s8 proto kernel scope link src 192.168.4.146 metric 101
[root@k8s-10-0-2-15 ~]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default nginx-deployment-5499b8bbf6-kwpcg 1/1 Running 16 (15m ago) 37d
kube-system calico-kube-controllers-647ddc7bfd-tznc2 1/1 Running 19 (15m ago) 37d
kube-system calico-node-g85j7 1/1 Running 17 (15m ago) 37d
kube-system coredns-5c4d969fb-xzc25 1/1 Running 16 (15m ago) 37d
kube-system metrics-server-74f6d6fdd5-7789w 1/1 Running 27 (15m ago) 37d
kube-system node-local-dns-l66x5 1/1 Running 16 (15m ago) 37d
[root@k8s-10-0-2-15 ~]# ip route get 10.68.0.1 from 172.20.18.67
RTNETLINK answers: Network is unreachable
[root@k8s-10-0-2-15 ~]# ip route
default via 10.0.2.2 dev enp0s3 proto dhcp metric 100
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
blackhole 172.20.100.192/26 proto bird
172.20.100.241 dev cali0764e1639a9 scope link
172.20.100.242 dev calic408eaf7f88 scope link
172.20.100.243 dev cali66b4acfa479 scope link
192.168.4.0/24 dev enp0s8 proto kernel scope link src 192.168.4.146 metric 101
目前的问题,全面指向两个关键点:
1. 关闭了eth1后集群丢失了默认路由
2. 关闭了eth1手,/etc/resolv.conf丢失了dns服务器地址
我觉得,现在先设置一个默认路由,就可以解决问题了。
deepseek
您分析得很对!两个关键问题确实是:
- 丢失默认路由导致网络通信异常
- /etc/resolv.conf 为空导致 DNS 解析失败
让我们先解决默认路由问题。由于您的网络没有真正的网关,我们可以添加一个指向本地接口的默认路由作为临时解决方案:
ip route add default via 192.168.1.100 dev eth0
请执行这个命令,然后我们观察集群组件状态是否改善。
输入
我修改了路由
手动添加了dns记录
删除了故障pod,集群恢复了正常。
[root@k8s-192-168-1-100 ~]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default nginx-deployment-5499b8bbf6-8spnp 0/1 ImagePullBackOff 0 3h42m
kube-system calico-kube-controllers-647ddc7bfd-www6d 1/1 Running 7 (3h40m ago) 35d
kube-system calico-node-h4x7g 1/1 Running 0 3h19m
kube-system coredns-5c4d969fb-l6smz 1/1 Running 40 (9m32s ago) 168m
kube-system metrics-server-74f6d6fdd5-6sqbl 1/1 Running 0 2m18s
kube-system node-local-dns-p24lp 1/1 Running 0 2s
kube-system traefik-ingress-controller-jlc4g 1/1 Running 65 (6m47s ago) 18h
kubernetes-dashboard kubernetes-dashboard-api-8458b5cfdc-tdrsf 1/1 Running 43 (8m11s ago) 20h
kubernetes-dashboard kubernetes-dashboard-auth-5b7f84fcd6-hzcp8 1/1 Running 4 (3h40m ago) 35d
kubernetes-dashboard kubernetes-dashboard-kong-648658d45f-2qps7 1/1 Running 67 (7m32s ago) 8d
kubernetes-dashboard kubernetes-dashboard-metrics-scraper-84cdc96677-m5xww 1/1 Running 53 (7m23s ago) 35d
kubernetes-dashboard kubernetes-dashboard-web-755f6d8c86-9ks4j 1/1 Running 4 (3h40m ago) 35d
lygame-service lygamelobbyserverv2-56769cc57d-2gnpp 1/1 Running 39 (7m23s ago) 4h48m
[root@k8s-192-168-1-100 ~]# ip route
default via 192.168.1.100 dev eth0
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.18.0.0/16 dev br-576d7ac025d3 proto kernel scope link src 172.18.0.1 linkdown
172.19.0.0/16 dev br-927deafb5155 proto kernel scope link src 172.19.0.1
blackhole 172.20.18.64/26 proto bird
172.20.18.67 dev caliba4f2dd07ff scope link
172.20.18.68 dev cali2cead5907f2 scope link
172.20.18.118 dev cali572b382024f scope link
172.20.18.120 dev cali62f5f0ab912 scope link
172.20.18.121 dev calif85593b14aa scope link
172.20.18.122 dev calicd0fb857d5c scope link
172.20.18.123 dev cali7c068dbd676 scope link
172.20.18.125 dev cali64c57a3c4b8 scope link
172.20.18.126 dev caliddf63ee0f2c scope link
172.20.18.127 dev cali64051cfd375 scope link
192.168.4.0/24 dev eth0 proto kernel scope link src 192.168.1.100 metric 100
[root@k8s-192-168-1-100 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
nameserver 223.5.5.5
nameserver 223.5.5.6
deepseek
太好了!您通过以下操作成功恢复了集群:
- 添加默认路由:
default via 192.168.1.100 dev eth0 - 配置 DNS:在
/etc/resolv.conf中添加了223.5.5.5和223.5.5.6 - 重启故障 Pod:让组件重新启动
现在集群所有核心组件都处于 Running 状态,问题已解决。您的分析完全正确 - 问题的根本原因确实是禁用 eth1 后丢失了默认路由和 DNS 配置。
为了确保重启后配置持久化,建议将这些网络设置添加到系统配置中。需要我指导您如何永久保存这些网络设置吗?
很好!以下是永久保存网络配置的方法:
1. 永久配置默认路由
编辑网络接口配置文件:
vi /etc/sysconfig/network-scripts/ifcfg-eth0
在文件中添加或修改以下行:
GATEWAY=192.168.1.100
DEFROUTE=yes
如果文件不存在或需要创建,确保包含以下基本配置:
DEVICE=eth0
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.1.100
NETMASK=255.255.255.0
GATEWAY=192.168.1.100
DEFROUTE=yes
2. 永久配置 DNS
编辑 resolv.conf 并设置不可变属性防止被覆盖:
# 编辑 DNS 配置
vi /etc/resolv.conf
# 确保内容如下:
nameserver 223.5.5.5
nameserver 223.5.5.6
# 设置文件为不可变,防止 NetworkManager 覆盖
chattr +i /etc/resolv.conf
3. 应用配置
重启网络服务或直接重启系统:
# 重启网络服务
systemctl restart network
# 或者重启整个系统
reboot
重启后请验证配置是否生效:
ip route show
cat /etc/resolv.conf
kubectl get pods -A
这样配置后,即使服务器重启,网络设置也会自动恢复。
直接重启服务器
[root@k8s-192-168-1-100 demo]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-647ddc7bfd-www6d 1/1 Running 8 (9m37s ago) 35d
kube-system calico-node-h4x7g 1/1 Running 1 (9m37s ago) 3h36m
kube-system coredns-5c4d969fb-l6smz 1/1 Running 41 (9m37s ago) 3h5m
kube-system metrics-server-74f6d6fdd5-6sqbl 1/1 Running 1 (9m37s ago) 19m
kube-system node-local-dns-p24lp 1/1 Running 1 (9m37s ago) 16m
kube-system traefik-ingress-controller-jlc4g 1/1 Running 66 (9m37s ago) 18h
kubernetes-dashboard kubernetes-dashboard-api-8458b5cfdc-tdrsf 1/1 Running 44 (9m37s ago) 20h
kubernetes-dashboard kubernetes-dashboard-auth-5b7f84fcd6-hzcp8 1/1 Running 5 (9m37s ago) 35d
kubernetes-dashboard kubernetes-dashboard-kong-648658d45f-2qps7 1/1 Running 68 (9m37s ago) 8d
kubernetes-dashboard kubernetes-dashboard-metrics-scraper-84cdc96677-m5xww 1/1 Running 54 (9m37s ago) 35d
kubernetes-dashboard kubernetes-dashboard-web-755f6d8c86-9ks4j 1/1 Running 5 (9m37s ago) 35d
lygame-service lygamelobbyserverv2-56b55fbd97-mtx5c 1/1 Running 0 2m10s
[root@k8s-192-168-1-100 demo]# cat /etc/os-release
NAME="openEuler"
VERSION="22.03 (LTS-SP4)"
ID="openEuler"
VERSION_ID="22.03"
PRETTY_NAME="openEuler 22.03 (LTS-SP4)"
ANSI_COLOR="0;31"
[root@k8s-192-168-1-100 demo]# ip route
default via 192.168.1.100 dev eth0 proto static metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.18.0.0/16 dev br-576d7ac025d3 proto kernel scope link src 172.18.0.1 linkdown
172.19.0.0/16 dev br-927deafb5155 proto kernel scope link src 172.19.0.1
blackhole 172.20.18.64/26 proto bird
172.20.18.65 dev calicd0fb857d5c scope link
172.20.18.66 dev calif85593b14aa scope link
172.20.18.69 dev cali64051cfd375 scope link
172.20.18.71 dev caliba4f2dd07ff scope link
172.20.18.72 dev cali2cead5907f2 scope link
172.20.18.73 dev cali64c57a3c4b8 scope link
172.20.18.75 dev caliddf63ee0f2c scope link
172.20.18.82 dev cali62f5f0ab912 scope link
172.20.18.91 dev califddf889b192 scope link
192.168.4.0/24 dev eth0 proto kernel scope link src 192.168.1.100 metric 100
结束。
AIOps的实现思路:
- 提供故障日志,并约定每次AI只输出一条指令
- 根据AI的指令,提供指令输出
- 如此循环,直到问题解决。
成熟后,AI可自动监控集群,并主动修复集群故障(软件因素导致的问题);硬件因素导致的问题,采用发送警报方案。

浙公网安备 33010602011771号