端口占用导致 node-local-dns 启动失败排查流程

前言

当前 K8s 集群基于阿里云 ACK 部署,由于长期未更新安全补丁,本次通过 dnf upgrade-minimal --security --allowerasing 对节点做了最小范围的安全补丁升级。补丁安装并重启节点后,发现一批 Pod 启动失败、卡在初始化阶段:均为通过 Service 域名探测依赖服务时失败,但依赖 Pod 本身却是正常的,且所有异常 Pod 都集中在同一台节点。顺着这条线索定位到该节点的 node-local-dns 处于 CrashLoopBackOff,进而发现节点本机的 dnsmasq 服务占用了 53 端口,导致 node-local-dns 无法启动、该节点上所有 Pod 解析 Service 域名失效。最终通过停用 dnsmasq、重建 node-local-dns 后,恢复 DNS 链路。

问题排查流程

问题现象

集群重启后,多个依赖 init 容器做就绪检查的 Pod 卡在初始化阶段。查看其日志,统一表现为 Service 地址解析失败:

nc 无法把 Service 域名解析成 IP。

# kubectl get pods -A --no-headers=true | awk '$4 != "Running"'
ingress-apisix            apisix-556b7b9c9c-cgzvp                       0/1    Init:0/1           0                9m58s
ingress-apisix            apisix-ingress-controller-5849fdcd67-s9wnb    0/1    Init:0/1           0                9m58s
kube-system               node-local-dns-x5jcn                          0/1    CrashLoopBackOff   6 (3m58s ago)    9m58s
otel-demo                 fraud-detection-5cfddbbb7b-69r96              0/1    Init:1/2           0                9m58s

# kubectl logs -f -n ingress-apisix apisix-556b7b9c9c-cgzvp -c wait-etcd
nc: bad address 'apisix-etcd.ingress-apisix.svc.cluster.local'
waiting for etcd Sun Jun 14 10:16:08 UTC 2026

此时 kube-system 下的 node-local-dns 也处于 CrashLoopBackOff,查看其日志,关键报错为:

53 端口被占用,导致 Pod 启动失败。

# kubectl logs --tail 30 -n kube-system node-local-dns-x5jcn -c node-cache
2026/06/14 10:22:07 [INFO] Starting node-cache image: v1.22.28-1-g5f96b759
2026/06/14 10:22:07 [INFO] Using Corefile /etc/coredns/Corefile
2026/06/14 10:22:07 [INFO] Using Pidfile 
2026/06/14 10:22:07 [ERROR] Failed to read node-cache coreFile /etc/coredns/Corefile.base - open /etc/coredns/Corefile.base: no such file or directory
2026/06/14 10:22:07 [INFO] Skipping kube-dns configmap sync as no directory was specified
Listen: listen tcp 169.254.20.10:53: bind: address already in use

排查流程

1.查询端口占用

# ss -ltunp | grep -w '53'
Netid  State   Recv-Q  Send-Q   Local Address:Port  Peer Address:Port  Process
udp    UNCONN  0       0        0.0.0.0:53          0.0.0.0:*          users:(("dnsmasq",pid=908,fd=4))
udp    UNCONN  0       0        127.0.0.53%lo:53    0.0.0.0:*          users:(("systemd-resolve",pid=747,fd=16))
udp    UNCONN  0       0        [::]:53             [::]:*             users:(("dnsmasq",pid=908,fd=6))
tcp    LISTEN  0       32       0.0.0.0:53          0.0.0.0:*          users:(("dnsmasq",pid=908,fd=5))
tcp    LISTEN  0       32       [::]:53             [::]:*             users:(("dnsmasq",pid=908,fd=7))

2.查询服务状态

# systemctl status dnsmasq.service --no-pager
# --no-pager: 禁止使用分页器,所有输出一次性打印到终端上.
● dnsmasq.service - DNS caching server.
   Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2026-06-14 18:10:23 CST; 20min ago
 Main PID: 908 (dnsmasq)
    Tasks: 1 (limit: 301507)
   Memory: 1.1M
   CGroup: /system.slice/dnsmasq.service
           └─908 /usr/sbin/dnsmasq -k

Jun 14 18:10:23 demo systemd[1]: Started DNS caching server..
Jun 14 18:10:23 demo dnsmasq[908]: started, version 2.79 cachesize 150
Jun 14 18:10:23 demo dnsmasq[908]: compile time options: IPv6 GNU-getopt DBus no-i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth DNSSEC loop-detect inotify
Jun 14 18:10:23 demo dnsmasq[908]: reading /etc/resolv.conf
Jun 14 18:10:23 demo dnsmasq[908]: using nameserver 100.100.2.136#53
Jun 14 18:10:23 demo dnsmasq[908]: using nameserver 100.100.2.138#53
Jun 14 18:10:23 demo dnsmasq[908]: read /etc/hosts - 4 addresses
Jun 14 18:10:23 demo dnsmasq[908]: reading /etc/resolv.conf

# ps -ef | grep -E 'dnsmasq' | grep -v grep
dnsmasq      908       1  0 18:10 ?        00:00:01 /usr/sbin/dnsmasq -k

# cat /etc/dnsmasq.conf 2>/dev/null | grep -vE '^#|^$'
user=dnsmasq
group=dnsmasq
conf-dir=/etc/dnsmasq.d,.rpmnew,.rpmsave,.rpmorig

3.查询依赖服务

# rpm -q --whatrequires dnsmasq
# 软件包层面的依赖
podman-plugins-4.9.4-30.0.1.al8.x86_64

# rpm -e --test dnsmasq
error: Failed dependencies:
        dnsmasq is needed by (installed) podman-plugins-4:4.9.4-30.0.1.al8.x86_64

# systemctl list-dependencies --reverse dnsmasq.service
# list-dependencies: 列出指定服务的依赖链(启动这个服务,还需要自动启动哪些其他服务)
# --reverse: 依赖方向反过来(谁需要这个服务)
# 没有任何业务服务依赖 dnsmasq
dnsmasq.service
● └─multi-user.target
●   └─graphical.target

4.查询 podman 服务详情

查询网络空间与 podman 容器后,发现其并没有使用

# ip netns list
cni-b02c3153-bf31-4853-98e4-55b35601ddab (id: 34)
cni-ebd975c4-9cfb-cf8d-6770-61bc927c95b0 (id: 32)
cni-f922a848-3354-d728-b591-b520f16563ca (id: 31)
cni-43ad1682-7b4e-8e2d-040e-37ad15faeb26 (id: 29)
cni-9c90f210-b117-fedd-3a31-e8c19b2a62bc (id: 24)
cni-1e4af947-9ae4-2121-e7c7-e33c693afc29 (id: 17)
cni-e3083ba9-eaab-184e-66c1-b5844db5150c (id: 14)
cni-4bad26ea-98ff-2b9f-fcaf-6ff1715a73f5 (id: 35)
cni-cf6c21e4-ac38-568c-95fc-e6516f533ee6 (id: 33)
cni-65a9eac4-35e4-f92f-6274-b013fcc53950 (id: 30)
cni-b6cf735c-c673-7844-7657-9a81ca61791b (id: 28)
cni-843d7b4d-645c-eb0a-11bc-24a4ad63d7c9 (id: 27)
cni-4c325f61-ab33-e1d8-bb5b-914b768c1117 (id: 26)
cni-4e34f0d5-2301-26a9-af12-0e2c5aab63f5 (id: 25)
cni-0a199a01-ae53-78e9-4223-36a7b155e393 (id: 23)
cni-cf57e136-4b36-53ae-0ceb-eb9ddab747ba (id: 22)
cni-5cc5dcae-5eed-d160-6131-b0acc80d94df (id: 21)
cni-659cecde-79d6-c648-e2ee-a3c1811920de (id: 20)
cni-470e6b3e-72bf-4bce-299a-d1cb21ab1ac6 (id: 19)
cni-64eeaa55-434c-e068-8f7d-6ac2452d4d26 (id: 18)
cni-85e81b3a-5c4f-53d1-859f-d8dc4744d996 (id: 16)
cni-dd81080f-043a-5d47-baf5-535acc722ea9 (id: 15)
cni-48bfaac7-95f9-5f97-fbf1-8236e139c9c7 (id: 13)
cni-882f277c-42e5-8857-fc83-4f96cefe6114 (id: 12)
cni-b06b1277-c5ef-6ffd-0e24-5c682d9f5e5f (id: 11)
cni-28cb03ef-ae29-5e85-6614-27f1499fee5d (id: 10)
cni-5ff74618-8cba-9f7b-f798-970ce894676f (id: 9)
cni-110bad94-c71d-ed67-6a9a-ec47c8d02c37 (id: 8)
cni-91346644-aa50-9575-08b0-e87e39f5722c (id: 7)
cni-9f8abd28-bc9c-ea8e-f6e5-9d50816d335c (id: 6)
cni-9f8da4fb-4c2f-5189-fd65-0339d6df1bec (id: 5)
cni-7de7f396-1b43-e950-279a-709d392ae04c (id: 4)
cni-49e826ef-6bb6-a5bc-0a7a-bf0cf496d081 (id: 3)
cni-1b38d6dc-f8ed-f711-171d-724cb67e24ac (id: 2)
cni-c31945d2-ed1d-c8b4-5a5c-1093d173e2a0 (id: 1)
cni-567ec094-363f-8d32-ed70-be11776de6f1 (id: 0)

# podman ps -a
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES

5.关闭 dnsmasq 服务

# systemctl stop dnsmasq
# systemctl disable dnsmasq
Removed /etc/systemd/system/multi-user.target.wants/dnsmasq.service.

# 将 dnsmasq 软链接到 /dev/null,不动 rpm 包,不破坏依赖关系.
# systemctl mask dnsmasq
Created symlink /etc/systemd/system/dnsmasq.service → /dev/null.

# 后续需要时,恢复即可.
# systemctl unmask dnsmasq
# systemctl enable --now dnsmasq
posted @ 2026-06-16 11:47  怎么还在写代码  阅读(15)  评论(0)    收藏  举报