00-故障排查技巧1

一、 故障排查技巧之describe

1.查看Pod名称

[root@master231 pods]# kubectl get pods -o wide
NAME           READY   STATUS             RESTARTS   AGE   IP              NODE        NOMINATED NODE   READINESS GATES
xiuxian-xixi   0/1     ImagePullBackOff   0          4s    10.100.140.71   worker233   <none>           <none>

2.获取Pod运行的详细信息

可以根据Events的详细输出信息进行错误诊断

[root@master231 pods]# kubectl describe pod xiuxian-xixi 
Name:         xiuxian-xixi
Namespace:    default
Priority:     0
Node:         worker233/10.0.0.233
Start Time:   Mon, 07 Apr 2025 16:13:12 +0800
Labels:       <none>
Annotations:  cni.projectcalico.org/containerID: 155fdcb306ff03f86652723716ccca7546f883c0a346428fa7220ddf293704cb
              cni.projectcalico.org/podIP: 10.100.140.71/32
              cni.projectcalico.org/podIPs: 10.100.140.71/32
Status:       Pending
IP:           10.100.140.71
IPs:
  IP:  10.100.140.71
Containers:
  xiuxian:
    Container ID:   
    Image:          registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bf6j2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-bf6j2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                From     Message
  ----     ------   ----               ----     -------
  Normal   BackOff  24s (x2 over 25s)  kubelet  Back-off pulling image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5"
  Warning  Failed   24s (x2 over 25s)  kubelet  Error: ImagePullBackOff
  Normal   Pulling  10s (x2 over 26s)  kubelet  Pulling image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5"
  Warning  Failed   9s (x2 over 25s)   kubelet  Failed to pull image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5": rpc error: code = Unknown desc = Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v5 not found: manifest unknown: manifest unknown
  Warning  Failed   9s (x2 over 25s)   kubelet  Error: ErrImagePull

二、故障排查技巧之logs

1.实时查看最近1分钟产生的日志

[root@master231 pods]# kubectl logs -f xiuxian-xixi --since=1m
10.0.0.231 - - [07/Apr/2025:08:17:30 +0000] "GET / HTTP/1.1" 200 357 "-" "curl/7.81.0" "-"

2.查看上一次容器重启前的日志【前提是目标容器是存在的】

[root@master231 pods]# kubectl logs -f xiuxian-xixi  -p

三、故障排查技巧之exec

1.在容器外部执行命令

[root@master231 pods]# kubectl exec xiuxian-xixi -- ifconfig
eth0      Link encap:Ethernet  HWaddr 02:B1:30:3E:42:FF  
          inet addr:10.100.140.72  Bcast:0.0.0.0  Mask:255.255.255.255
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:37 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:3657 (3.5 KiB)  TX bytes:2614 (2.5 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

2.可以连接到一个正在运行的容器

[root@master231 pods]# kubectl exec -it xiuxian-xixi -- sh
/ # ifconfig 
eth0      Link encap:Ethernet  HWaddr 02:B1:30:3E:42:FF  
          inet addr:10.100.140.72  Bcast:0.0.0.0  Mask:255.255.255.255
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:37 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:3657 (3.5 KiB)  TX bytes:2614 (2.5 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

/ # 

四、故障排查实战

1.故障复现

使用Pod运行两个有nginx服务的容器

#编写资源清单
[root@master231 pods]# cat 02-pods-multiple-xiuxian.yaml
apiVersion: v1
kind: Pod
metadata:
  name: xiuxian-haha
spec:
  nodeName: worker233
  containers:
  - image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
    name: c1
  - image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
    name: c2
    
#运行Pod
[root@master231 pods]# kubectl apply -f 02-pods-multiple-xiuxian.yaml 
pod/xiuxian-haha created

#查看状态
#刚开始正常,等待几秒error
[root@master231 pods]# kubectl get pods -o wide
NAME           READY   STATUS    RESTARTS     AGE   IP              NODE        NOMINATED NODE   READINESS GATES
xiuxian-haha   2/2     Running   1 (2s ago)   6s    10.100.140.73   worker233   <none>           <none>
[root@master231 pods]# 
[root@master231 pods]# kubectl get pods -o wide
NAME           READY   STATUS   RESTARTS     AGE   IP              NODE        NOMINATED NODE   READINESS GATES
xiuxian-haha   1/2     Error    1 (9s ago)   13s   10.100.140.73   worker233   <none>           <none>

2.使用describe查看详细信息

[root@master231 pods]# kubectl describe pod xiuxian-haha 
Name:         xiuxian-haha
Namespace:    default
Priority:     0
Node:         worker233/10.0.0.233
Start Time:   Mon, 07 Apr 2025 16:26:17 +0800
Labels:       <none>
Annotations:  cni.projectcalico.org/containerID: 8cddce2ce2ae3ac208fb0c425a07792ec77f55ddc85c609beda975335888e1e6
              cni.projectcalico.org/podIP: 10.100.140.73/32
              cni.projectcalico.org/podIPs: 10.100.140.73/32
Status:       Running
IP:           10.100.140.73
IPs:
  IP:  10.100.140.73
Containers:
  c1:
    Container ID:   docker://5c5eee925282946e60a2897402118408ba2d042a5d99cc0879c08f509cd97df4
    Image:          registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
    Image ID:       docker-pullable://registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps@sha256:3bee216f250cfd2dbda1744d6849e27118845b8f4d55dda3ca3c6c1227cc2e5c
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 07 Apr 2025 16:26:17 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8mbhb (ro)
  c2:
    Container ID:   docker://45b7828506836596bd6794ecafbd79e1d445e5728fc7c5290351be64f24c9cae
    Image:          registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
    Image ID:       docker-pullable://registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps@sha256:3ac38ee6161e11f2341eda32be95dcc6746f587880f923d2d24a54c3a525227e
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 07 Apr 2025 16:27:06 +0800
      Finished:     Mon, 07 Apr 2025 16:27:09 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 07 Apr 2025 16:26:37 +0800
      Finished:     Mon, 07 Apr 2025 16:26:40 +0800
    Ready:          False
    Restart Count:  3
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8mbhb (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-8mbhb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                From     Message
  ----     ------   ----               ----     -------
  Normal   Pulled   65s                kubelet  Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1" already present on machine
  Normal   Created  65s                kubelet  Created container c1
  Normal   Started  65s                kubelet  Started container c1
  Normal   Pulled   16s (x4 over 65s)  kubelet  Container image "registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2" already present on machine
  Normal   Created  16s (x4 over 65s)  kubelet  Created container c2
  Normal   Started  16s (x4 over 65s)  kubelet  Started container c2
  Warning  BackOff  12s (x4 over 59s)  kubelet  Back-off restarting failed container

3.分析得出

- 1.c1正常启动,c2不正常启动,但是我们单独运行时,这两个容器时可以正常运行的;
- 2.我们知道,这两个容器都是运行了nginx,因此两者应该是端口冲突导致的;(如果你 不知道这两个镜像的运行进程,可以继续往下看)

4.故障排查技巧之修改容器的启动命令案例

4.1 使用指令修改容器启动方式

[root@master231 pods]# cat 02-pods-multiple-xiuxian.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: xiuxian-haha
spec:
  nodeName: worker233
  containers:
  - image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v1
    name: c1
  - image: registry.cn-hangzhou.aliyuncs.com/yinzhengjie-k8s/apps:v2
    name: c2
    # 修改容器的启动命令,相当于修改了Dockerfile的ENTRYPOINT指令。
    # command: ["tail","-f","/etc/hosts"]
    # 修改容器的启动命令,相当于修改了Dockerfile的CMD指令。
    # args: ["sleep","3600"]
    # 如果command和args搭配使用,则args将作为参数传递给command。
    command:
    - tail
    args:
    - -f
    - /etc/hosts

4.2 运行Pod

[root@master231 pods]# kubectl create -f 02-pods-multiple-xiuxian.yaml 
pod/xiuxian-haha created

[root@master231 pods]# kubectl get pods -o wide
NAME           READY   STATUS    RESTARTS   AGE   IP              NODE        NOMINATED NODE   READINESS GATES
xiuxian-haha   2/2     Running   0          4s    10.100.140.76   worker233   <none>           <none>

4.3 进入容器查看端口和进程

容器1

[root@master231 pods]# kubectl exec -it xiuxian-haha -c c1 -- sh
/ # 
/ # netstat -untalp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      1/nginx: master pro
tcp        0      0 :::80                   :::*                    LISTEN      1/nginx: master pro
/ # 
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 nginx: master process nginx -g daemon off;
   32 nginx     0:00 nginx: worker process
   33 nginx     0:00 nginx: worker process
   34 root      0:00 sh
   41 root      0:00 ps -ef

容器2

[root@master231 pods]# kubectl exec -it xiuxian-haha -c c2 -- sh
/ # 
/ # netstat -untalp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      -
tcp        0      0 :::80                   :::*                    LISTEN      -
/ # 
/ # ps -ef 
PID   USER     TIME  COMMAND
    1 root      0:00 tail -f /etc/hosts
    7 root      0:00 sh
   14 root      0:00 ps -ef

4.4 修改容器2的服务端口

[root@master231 pods]# kubectl exec -it xiuxian-haha -c c2 -- sh
/ # sed -i '/listen/s#80#81#g' /etc/nginx/conf.d/default.conf 
/ # 
/ # grep listen /etc/nginx/conf.d/default.conf 
    listen       81;
    # proxy the PHP scripts to Apache listening on 127.0.0.1:81
    # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
/ # 
/ # nginx
2025/04/07 08:43:28 [notice] 22#22: using the "epoll" event method
2025/04/07 08:43:28 [notice] 22#22: nginx/1.20.1
2025/04/07 08:43:28 [notice] 22#22: built by gcc 10.2.1 20201203 (Alpine 10.2.1_pre1) 
2025/04/07 08:43:28 [notice] 22#22: OS: Linux 5.15.0-119-generic
2025/04/07 08:43:28 [notice] 22#22: getrlimit(RLIMIT_NOFILE): 524288:524288
/ # 2025/04/07 08:43:28 [notice] 23#23: start worker processes
2025/04/07 08:43:28 [notice] 23#23: start worker process 24
2025/04/07 08:43:28 [notice] 23#23: start worker process 25

/ # 
/ # 
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 tail -f /etc/hosts
    7 root      0:00 sh
   23 root      0:00 nginx: master process nginx
   24 nginx     0:00 nginx: worker process
   25 nginx     0:00 nginx: worker process
   26 root      0:00 ps -ef
/ # 
/ # netstat -untalp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:81              0.0.0.0:*               LISTEN      23/nginx: master pr
tcp        0      0 :::80                   :::*                    LISTEN      -
/ # 

4.5 查看列表状态

[root@master231 pods]# kubectl get pods -o wide
NAME           READY   STATUS    RESTARTS   AGE     IP              NODE        NOMINATED NODE   READINESS GATES
xiuxian-haha   2/2     Running   0          4m59s   10.100.140.76   worker233   <none>           <none>

4.6 测试访问

[root@master231 pods]# curl 10.100.140.76:80
[root@master231 pods]# curl 10.100.140.76:81

五、故障排查技巧之explain

1.explain

可以查询资源清单特定字段的文档信息,在编写资源清单,或者获取相关字段的帮助信息时非常有用。

语法格式
	kubect explain   <type>.<fieldName>[.<fieldName>]

2.常用的数据类型

<string>
	字符串类型,表示值为一个字符串,通常情况下可以使用双引号引起来即可。
	
	大多数情况下,我们是可以省略双引号,但是特殊值要注意,比如:数字"[0-9]",比如"yes","no","true","false"就必须使用双引号。
	
<Object>
	表示的是该字段有下级字段,后边不能直接写,这是一个对象,如果不想写该字段的子字段,可以使用"{}"进行占位。
	
<map[string]string>
	这是go语言的一种map类型,是一个键值对的数据,其中key是字符串类型,value也是字符串类型。
	
	通常情况下,表示key和value由用户自行定义即可。
	
<boolean>
	布尔值是一个特殊类型,通常情况下只有两个有效值为true,false。
	
<[]Object>
	数组对象,下级对象可以定义多个并列关系,使用"-"进行分割。
	
	如果将来想要找到不同的元素,需要使用下标寻址,比如0,1,2,....

 
 <[]string>
	数组字符串,表示有多个字符串并列关系。

<integer> 
	表示是一个整型,说白了,之必须是一个整数。一般情况下是非负数。类似于Go语言的"uint64"。

-required-
	必填字段,该字段无法省略。

3.实战案例

[root@master231 service]# kubectl explain svc
[root@master231 service]# kubectl explain svc.spec.ports
[root@master231 service]# kubectl explain po.spec.containers
[root@master231 service]# kubectl explain po.spec.containers.env
posted @ 2025-04-07 23:21  丁志岩  阅读(24)  评论(0)    收藏  举报