00-故障排查技巧1
一、 故障排查技巧之describe
1.查看Pod名称
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xiuxian-xixi 0/1 ImagePullBackOff 0 4s 10.100.140.71 worker233 <none> <none>
2.获取Pod运行的详细信息
可以根据Events的详细输出信息进行错误诊断
[root@master231 pods]# kubectl describe pod xiuxian-xixi
Name: xiuxian-xixi
Namespace: default
Priority: 0
Node: worker233/10.0.0.233
Start Time: Mon, 07 Apr 2025 16:13:12 +0800
Labels: <none>
Annotations: cni.projectcalico.org/containerID: 155fdcb306ff03f86652723716ccca7546f883c0a346428fa7220ddf293704cb
cni.projectcalico.org/podIP: 10.100.140.71/32
cni.projectcalico.org/podIPs: 10.100.140.71/32
Status: Pending
IP: 10.100.140.71
IPs:
IP: 10.100.140.71
Containers:
xiuxian:
Container ID:
Image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bf6j2 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-bf6j2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BackOff 24s (x2 over 25s) kubelet Back-off pulling image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5"
Warning Failed 24s (x2 over 25s) kubelet Error: ImagePullBackOff
Normal Pulling 10s (x2 over 26s) kubelet Pulling image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5"
Warning Failed 9s (x2 over 25s) kubelet Failed to pull image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5": rpc error: code = Unknown desc = Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5 not found: manifest unknown: manifest unknown
Warning Failed 9s (x2 over 25s) kubelet Error: ErrImagePull
二、故障排查技巧之logs
1.实时查看最近1分钟产生的日志
[root@master231 pods]# kubectl logs -f xiuxian-xixi --since=1m
10.0.0.231 - - [07/Apr/2025:08:17:30 +0000] "GET / HTTP/1.1" 200 357 "-" "curl/7.81.0" "-"
2.查看上一次容器重启前的日志【前提是目标容器是存在的】
[root@master231 pods]# kubectl logs -f xiuxian-xixi -p
三、故障排查技巧之exec
1.在容器外部执行命令
[root@master231 pods]# kubectl exec xiuxian-xixi -- ifconfig
eth0 Link encap:Ethernet HWaddr 02:B1:30:3E:42:FF
inet addr:10.100.140.72 Bcast:0.0.0.0 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:37 errors:0 dropped:0 overruns:0 frame:0
TX packets:18 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3657 (3.5 KiB) TX bytes:2614 (2.5 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
2.可以连接到一个正在运行的容器
[root@master231 pods]# kubectl exec -it xiuxian-xixi -- sh
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 02:B1:30:3E:42:FF
inet addr:10.100.140.72 Bcast:0.0.0.0 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:37 errors:0 dropped:0 overruns:0 frame:0
TX packets:18 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3657 (3.5 KiB) TX bytes:2614 (2.5 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
/ #
四、故障排查实战
1.故障复现
使用Pod运行两个有nginx服务的容器
#编写资源清单
[root@master231 pods]# cat 02-pods-multiple-xiuxian.yaml
apiVersion: v1
kind: Pod
metadata:
name: xiuxian-haha
spec:
nodeName: worker233
containers:
- image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
name: c1
- image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
name: c2
#运行Pod
[root@master231 pods]# kubectl apply -f 02-pods-multiple-xiuxian.yaml
pod/xiuxian-haha created
#查看状态
#刚开始正常,等待几秒error
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xiuxian-haha 2/2 Running 1 (2s ago) 6s 10.100.140.73 worker233 <none> <none>
[root@master231 pods]#
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xiuxian-haha 1/2 Error 1 (9s ago) 13s 10.100.140.73 worker233 <none> <none>
2.使用describe查看详细信息
[root@master231 pods]# kubectl describe pod xiuxian-haha
Name: xiuxian-haha
Namespace: default
Priority: 0
Node: worker233/10.0.0.233
Start Time: Mon, 07 Apr 2025 16:26:17 +0800
Labels: <none>
Annotations: cni.projectcalico.org/containerID: 8cddce2ce2ae3ac208fb0c425a07792ec77f55ddc85c609beda975335888e1e6
cni.projectcalico.org/podIP: 10.100.140.73/32
cni.projectcalico.org/podIPs: 10.100.140.73/32
Status: Running
IP: 10.100.140.73
IPs:
IP: 10.100.140.73
Containers:
c1:
Container ID: docker://5c5eee925282946e60a2897402118408ba2d042a5d99cc0879c08f509cd97df4
Image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
Image ID: docker-pullable://registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps@sha256:3bee216f250cfd2dbda1744d6849e27118845b8f4d55dda3ca3c6c1227cc2e5c
Port: <none>
Host Port: <none>
State: Running
Started: Mon, 07 Apr 2025 16:26:17 +0800
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8mbhb (ro)
c2:
Container ID: docker://45b7828506836596bd6794ecafbd79e1d445e5728fc7c5290351be64f24c9cae
Image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
Image ID: docker-pullable://registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps@sha256:3ac38ee6161e11f2341eda32be95dcc6746f587880f923d2d24a54c3a525227e
Port: <none>
Host Port: <none>
State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 07 Apr 2025 16:27:06 +0800
Finished: Mon, 07 Apr 2025 16:27:09 +0800
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 07 Apr 2025 16:26:37 +0800
Finished: Mon, 07 Apr 2025 16:26:40 +0800
Ready: False
Restart Count: 3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8mbhb (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-8mbhb:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 65s kubelet Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1" already present on machine
Normal Created 65s kubelet Created container c1
Normal Started 65s kubelet Started container c1
Normal Pulled 16s (x4 over 65s) kubelet Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2" already present on machine
Normal Created 16s (x4 over 65s) kubelet Created container c2
Normal Started 16s (x4 over 65s) kubelet Started container c2
Warning BackOff 12s (x4 over 59s) kubelet Back-off restarting failed container
3.分析得出
- 1.c1正常启动,c2不正常启动,但是我们单独运行时,这两个容器时可以正常运行的;
- 2.我们知道,这两个容器都是运行了nginx,因此两者应该是端口冲突导致的;(如果你 不知道这两个镜像的运行进程,可以继续往下看)
4.故障排查技巧之修改容器的启动命令案例
4.1 使用指令修改容器启动方式
[root@master231 pods]# cat 02-pods-multiple-xiuxian.yaml
apiVersion: v1
kind: Pod
metadata:
name: xiuxian-haha
spec:
nodeName: worker233
containers:
- image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
name: c1
- image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
name: c2
# 修改容器的启动命令,相当于修改了Dockerfile的ENTRYPOINT指令。
# command: ["tail","-f","/etc/hosts"]
# 修改容器的启动命令,相当于修改了Dockerfile的CMD指令。
# args: ["sleep","3600"]
# 如果command和args搭配使用,则args将作为参数传递给command。
command:
- tail
args:
- -f
- /etc/hosts
4.2 运行Pod
[root@master231 pods]# kubectl create -f 02-pods-multiple-xiuxian.yaml
pod/xiuxian-haha created
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xiuxian-haha 2/2 Running 0 4s 10.100.140.76 worker233 <none> <none>
4.3 进入容器查看端口和进程
容器1
[root@master231 pods]# kubectl exec -it xiuxian-haha -c c1 -- sh
/ #
/ # netstat -untalp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 1/nginx: master pro
tcp 0 0 :::80 :::* LISTEN 1/nginx: master pro
/ #
/ # ps -ef
PID USER TIME COMMAND
1 root 0:00 nginx: master process nginx -g daemon off;
32 nginx 0:00 nginx: worker process
33 nginx 0:00 nginx: worker process
34 root 0:00 sh
41 root 0:00 ps -ef
容器2
[root@master231 pods]# kubectl exec -it xiuxian-haha -c c2 -- sh
/ #
/ # netstat -untalp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN -
tcp 0 0 :::80 :::* LISTEN -
/ #
/ # ps -ef
PID USER TIME COMMAND
1 root 0:00 tail -f /etc/hosts
7 root 0:00 sh
14 root 0:00 ps -ef
4.4 修改容器2的服务端口
[root@master231 pods]# kubectl exec -it xiuxian-haha -c c2 -- sh
/ # sed -i '/listen/s#80#81#g' /etc/nginx/conf.d/default.conf
/ #
/ # grep listen /etc/nginx/conf.d/default.conf
listen 81;
# proxy the PHP scripts to Apache listening on 127.0.0.1:81
# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
/ #
/ # nginx
2025/04/07 08:43:28 [notice] 22#22: using the "epoll" event method
2025/04/07 08:43:28 [notice] 22#22: nginx/1.20.1
2025/04/07 08:43:28 [notice] 22#22: built by gcc 10.2.1 20201203 (Alpine 10.2.1_pre1)
2025/04/07 08:43:28 [notice] 22#22: OS: Linux 5.15.0-119-generic
2025/04/07 08:43:28 [notice] 22#22: getrlimit(RLIMIT_NOFILE): 524288:524288
/ # 2025/04/07 08:43:28 [notice] 23#23: start worker processes
2025/04/07 08:43:28 [notice] 23#23: start worker process 24
2025/04/07 08:43:28 [notice] 23#23: start worker process 25
/ #
/ #
/ # ps -ef
PID USER TIME COMMAND
1 root 0:00 tail -f /etc/hosts
7 root 0:00 sh
23 root 0:00 nginx: master process nginx
24 nginx 0:00 nginx: worker process
25 nginx 0:00 nginx: worker process
26 root 0:00 ps -ef
/ #
/ # netstat -untalp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:81 0.0.0.0:* LISTEN 23/nginx: master pr
tcp 0 0 :::80 :::* LISTEN -
/ #
4.5 查看列表状态
[root@master231 pods]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xiuxian-haha 2/2 Running 0 4m59s 10.100.140.76 worker233 <none> <none>
4.6 测试访问
[root@master231 pods]# curl 10.100.140.76:80
[root@master231 pods]# curl 10.100.140.76:81
五、故障排查技巧之explain
1.explain
可以查询资源清单特定字段的文档信息,在编写资源清单,或者获取相关字段的帮助信息时非常有用。
语法格式
kubect explain <type>.<fieldName>[.<fieldName>]
2.常用的数据类型
<string>
字符串类型,表示值为一个字符串,通常情况下可以使用双引号引起来即可。
大多数情况下,我们是可以省略双引号,但是特殊值要注意,比如:数字"[0-9]",比如"yes","no","true","false"就必须使用双引号。
<Object>
表示的是该字段有下级字段,后边不能直接写,这是一个对象,如果不想写该字段的子字段,可以使用"{}"进行占位。
<map[string]string>
这是go语言的一种map类型,是一个键值对的数据,其中key是字符串类型,value也是字符串类型。
通常情况下,表示key和value由用户自行定义即可。
<boolean>
布尔值是一个特殊类型,通常情况下只有两个有效值为true,false。
<[]Object>
数组对象,下级对象可以定义多个并列关系,使用"-"进行分割。
如果将来想要找到不同的元素,需要使用下标寻址,比如0,1,2,....
<[]string>
数组字符串,表示有多个字符串并列关系。
<integer>
表示是一个整型,说白了,之必须是一个整数。一般情况下是非负数。类似于Go语言的"uint64"。
-required-
必填字段,该字段无法省略。
3.实战案例
[root@master231 service]# kubectl explain svc
[root@master231 service]# kubectl explain svc.spec.ports
[root@master231 service]# kubectl explain po.spec.containers
[root@master231 service]# kubectl explain po.spec.containers.env
本文来自博客园,作者:丁志岩,转载请注明原文链接:https://www.cnblogs.com/dezyan/p/18813806

浙公网安备 33010602011771号