Metrics Server和kubelet cAdvisor的区别及部署
首先,简单来说区别如下
Metrics Server是监控k8s中节点和pod的,还支持hpa
cAdvistor是监控pod中的容器的
Metrics Server 从 Kubelet 获取数据,而 Kubelet 会从 cAdvisor 获取 Pod 和容器的资源使用信息
一、metrics-server环境部署及故障排查案例
1.什么metrics-server
metrics-server为K8S集群的"kubectl top"命令提供数据监控,也提供了"HPA(HorizontalPodAutoscaler)"的使用。
[root@master231 ~]# kubectl top pods
error: Metrics API not available
[root@master231 ~]#
[root@master231 ~]# kubectl top nodes
error: Metrics API not available
[root@master231 ~]#
部署文档
https://github.com/kubernetes-sigs/metrics-server
2.hpa和vpa的区别?
- hpa:
表示Pod数量资源不足时,可以自动增加Pod副本数量,以抵抗流量过多的情况,降低负载。
- vpa:
表示可以动态调整容器的资源上线,比如一个Pod一开始是200Mi内存,如果资源达到定义的阈值,就可以扩展内存,但不会增加pod副本数量。
典型的区别在于vpa具有一定的资源上限问题,因为pod是K8S集群调度的最小单元,不可拆分,因此这个将来扩容时,取决于单节点的资源上限。
3.metrics-server组件本质上是从kubelet组件获取监控数据
[root@master231 pki]# pwd
/etc/kubernetes/pki
[root@master231 pki]#
[root@master231 pki]# ll apiserver-kubelet-client.*
-rw-r--r-- 1 root root 1164 Apr 7 11:00 apiserver-kubelet-client.crt
-rw------- 1 root root 1679 Apr 7 11:00 apiserver-kubelet-client.key
[root@master231 pki]#
[root@master231 pki]# curl -s -k --key apiserver-kubelet-client.key --cert apiserver-kubelet-client.crt https://10.0.0.231:10250/metrics/resource | wc -l
102
[root@master231 pki]#
[root@master231 pki]# curl -s -k --key apiserver-kubelet-client.key --cert apiserver-kubelet-client.crt https://10.0.0.232:10250/metrics/resource | wc -l
67
[root@master231 pki]#
[root@master231 pki]# curl -s -k --key apiserver-kubelet-client.key --cert apiserver-kubelet-client.crt https://10.0.0.233:10250/metrics/resource | wc -l
57
[root@master231 pki]#
4.部署metrics-server组件
4.1 下载资源清单
[root@master231 ~]# wget https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml
4.2 编辑配置文件
[root@master231 ~]# vim high-availability-1.21+.yaml
...
114 apiVersion: apps/v1
115 kind: Deployment
116 metadata:
...
144 - args:
145 - --kubelet-insecure-tls # 不要验证Kubelets提供的服务证书的CA。不配置则会报错x509。
...
4.3 部署metrics-server组件
[root@master231 ~]# kubectl apply -f high-availability-1.21+.yaml
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
poddisruptionbudget.policy/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
[root@master231 ~]#
4.4 查看镜像是否部署成功
[root@master231 ~]# kubectl get pods -o wide -n kube-system -l k8s-app=metrics-server
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-server-6b4f784878-gwsf5 1/1 Running 0 27s 10.100.203.150 worker232 <none> <none>
metrics-server-6b4f784878-qjvwr 1/1 Running 0 27s 10.100.140.81 worker233 <none> <none>
4.5 验证metrics组件是否正常工作
[root@master231 ~]# kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
master231 136m 6% 2981Mi 78%
worker232 53m 2% 1707Mi 45%
worker233 45m 2% 1507Mi 39%
[root@master231 ~]#
[root@master231 ~]# kubectl top pods -n kube-system
NAME CPU(cores) MEMORY(bytes)
coredns-6d8c4cb4d-bknzr 1m 11Mi
coredns-6d8c4cb4d-cvp9w 1m 31Mi
etcd-master231 10m 75Mi
kube-apiserver-master231 33m 334Mi
kube-controller-manager-master231 8m 56Mi
kube-proxy-29dbp 4m 19Mi
kube-proxy-hxmzb 7m 18Mi
kube-proxy-k92k2 1m 31Mi
kube-scheduler-master231 2m 17Mi
metrics-server-6b4f784878-gwsf5 2m 17Mi
metrics-server-6b4f784878-qjvwr 2m 17Mi
[root@master231 ~]#
[root@master231 ~]# kubectl top pods -A
NAMESPACE NAME CPU(cores) MEMORY(bytes)
calico-apiserver calico-apiserver-64b779ff45-cspxl 4m 28Mi
calico-apiserver calico-apiserver-64b779ff45-fw6pc 3m 29Mi
calico-system calico-kube-controllers-76d5c7cfc-89z7j 3m 16Mi
calico-system calico-node-4cvnj 16m 140Mi
calico-system calico-node-qbxmn 16m 143Mi
calico-system calico-node-scwkd 17m 138Mi
calico-system calico-typha-595f8c6fcb-bhdw6 1m 18Mi
calico-system calico-typha-595f8c6fcb-f2fw6 2m 22Mi
calico-system csi-node-driver-2mzq6 1m 8Mi
calico-system csi-node-driver-7z4hj 1m 8Mi
calico-system csi-node-driver-m66z9 1m 15Mi
default xiuxian-6dffdd86b-m8f2h 1m 33Mi
kube-system coredns-6d8c4cb4d-bknzr 1m 11Mi
kube-system coredns-6d8c4cb4d-cvp9w 1m 31Mi
kube-system etcd-master231 16m 74Mi
kube-system kube-apiserver-master231 35m 334Mi
kube-system kube-controller-manager-master231 9m 57Mi
kube-system kube-proxy-29dbp 4m 19Mi
kube-system kube-proxy-hxmzb 7m 18Mi
kube-system kube-proxy-k92k2 10m 31Mi
kube-system kube-scheduler-master231 2m 17Mi
kube-system metrics-server-6b4f784878-gwsf5 2m 17Mi
kube-system metrics-server-6b4f784878-qjvwr 2m 17Mi
kuboard kuboard-agent-2-6964c46d56-cm589 5m 9Mi
kuboard kuboard-agent-77dd5dcd78-jc4rh 5m 24Mi
kuboard kuboard-etcd-qs5jh 4m 35Mi
kuboard kuboard-v3-685dc9c7b8-2pd2w 36m 353Mi
metallb-system controller-686c7db689-cnj2c 1m 18Mi
metallb-system speaker-srvw8 3m 31Mi
metallb-system speaker-tgwql 3m 17Mi
metallb-system speaker-zpn5c 3m 17Mi
tigera-operator tigera-operator-8d497bb9f-bcj5s 2m 27Mi
二、水平Pod伸缩hpa实战
1.什么是hpa
hpa是k8s集群内置的资源,全称为"HorizontalPodAutoscaler"。
可以自动实现Pod水平伸缩,说白了,在业务高峰期可以自动扩容Pod副本数量,在集群的低谷期,可以自动缩容Pod副本数量。
2.hpa
2.1 创建Pod
[root@master231 ~]# cat 01-deploy-hpa.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: stress
spec:
replicas: 1
selector:
matchLabels:
app: stress
template:
metadata:
labels:
app: stress
spec:
containers:
- image: jasonyin2020/dezyan-linux-tools:v0.1
name: dezyan-linux-tools
args:
- tail
- -f
- /etc/hosts
resources:
requests:
cpu: 0.2
memory: 300Mi
limits:
cpu: 0.5
memory: 500Mi
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: stress-hpa
spec:
maxReplicas: 5
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: stress
targetCPUUtilizationPercentage: 95
[root@master231 ~]# kubectl apply -f 01-deploy-hpa.yaml
deployment.apps/stress created
horizontalpodautoscaler.autoscaling/stress-hpa created
[root@master231 ~]#
[root@master231 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
stress-5585b5ccc-tlf8p 0/1 ContainerCreating 0 7s <none> worker233 <none> <none>
[root@master231 ~]#
[root@master231 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
stress-5585b5ccc-tlf8p 1/1 Running 0 15s 10.100.140.80 worker233 <none> <none>
[root@master231 ~]#
2.2 测试验证
[root@master231 ~]# kubectl get deployments.apps
NAME READY UP-TO-DATE AVAILABLE AGE
stress 2/2 2 2 94s
[root@master231 ~]#
[root@master231 ~]# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
stress-hpa Deployment/stress 0%/95% 2 5 2 98s
[root@master231 ~]#
[root@master231 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
stress-5585b5ccc-6hm85 1/1 Running 0 85s 10.100.203.154 worker232 <none> <none>
stress-5585b5ccc-tlf8p 1/1 Running 0 100s 10.100.140.80 worker233 <none> <none>
[root@master231 ~]#
响应式创建hpa
[root@master231 horizontalpodautoscalers]# kubectl autoscale deploy stress --min=2 --max=5 --cpu-percent=95 -o yaml --dry-run=client
2.3 压力测试
[root@master231 ~]# kubectl exec stress-5585b5ccc-6hm85 -- stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10m
stress: info: [7] dispatching hogs: 8 cpu, 4 io, 2 vm, 0 hdd
2.4 查看Pod副本数量
[root@master231 ~]# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
stress-hpa Deployment/stress 125%/95% 2 5 3 4m48s
[root@master231 ~]#
[root@master231 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
stress-5585b5ccc-6hm85 1/1 Running 0 5m34s 10.100.203.154 worker232 <none> <none>
stress-5585b5ccc-b2wdd 1/1 Running 0 78s 10.100.140.83 worker233 <none> <none>
stress-5585b5ccc-tlf8p 1/1 Running 0 5m49s 10.100.140.80 worker233 <none> <none>
[root@master231 ~]#
[root@master231 ~]# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
stress-hpa Deployment/stress 83%/95% 2 5 3 5m31s
2.5 再次压测
[root@master231 ~]# kubectl exec stress-5585b5ccc-b2wdd -- stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10m
stress: info: [6] dispatching hogs: 8 cpu, 4 io, 2 vm, 0 hdd
[root@master231 ~]# kubectl exec stress-5585b5ccc-tlf8p -- stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10m
stress: info: [7] dispatching hogs: 8 cpu, 4 io, 2 vm, 0 hdd
2.6 发现最多有5个Pod创建
[root@master231 ~]# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
stress-hpa Deployment/stress 177%/95% 2 5 3 7m27s
[root@master231 ~]#
[root@master231 ~]# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
stress-hpa Deployment/stress 250%/95% 2 5 5 7m33s
[root@master231 ~]#
[root@master231 ~]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
stress-5585b5ccc-6hm85 1/1 Running 0 7m59s 10.100.203.154 worker232 <none> <none>
stress-5585b5ccc-b2wdd 1/1 Running 0 3m43s 10.100.140.83 worker233 <none> <none>
stress-5585b5ccc-l6d97 1/1 Running 0 58s 10.100.203.149 worker232 <none> <none>
stress-5585b5ccc-sqlzz 1/1 Running 0 58s 10.100.140.82 worker233 <none> <none>
stress-5585b5ccc-tlf8p 1/1 Running 0 8m14s 10.100.140.80 worker233 <none> <none>
[root@master231 ~]#
[root@master231 ~]# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
stress-hpa Deployment/stress 150%/95% 2 5 5 8m26s
[root@master231 ~]#
2.7 取消压测后
需要等待5min会自动缩容Pod数量到2个。
3.故障排查案例
[root@master231 ~]# kubectl get pods -o wide -n kube-system -l k8s-app=metrics-server
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
metrics-server-6f5b66d8f9-fvbqm 0/1 Running 0 15m 10.100.203.151 worker232 <none> <none>
metrics-server-6f5b66d8f9-n2zxs 0/1 Running 0 15m 10.100.140.77 worker233 <none> <none>
[root@master231 ~]#
[root@master231 ~]# kubectl -n kube-system logs metrics-server-6f5b66d8f9-fvbqm
...
E0414 09:30:03.341444 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.0.0.233:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 10.0.0.233 because it doesn't contain any IP SANs" node="worker233"
E0414 09:30:03.352008 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.0.0.232:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 10.0.0.232 because it doesn't contain any IP SANs" node="worker232"
E0414 09:30:03.354140 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.0.0.231:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 10.0.0.231 because it doesn't contain any IP SANs" node="master231"
问题分析:
证书认证失败,导致无法获取数据。
解决方案:
[root@master231 ~]# vim high-availability-1.21+.yaml
...
114 apiVersion: apps/v1
115 kind: Deployment
116 metadata:
...
144 - args:
145 - --kubelet-insecure-tls # 不要验证Kubelets提供的服务证书的CA。不配置则会报错x509。
...
三、Prometheus基于cAdvisor监控容器
1.在docker节点拉取镜像
[root@elk92 ~]# docker pull gcr.io/cadvisor/cadvisor-amd64:v0.52.1
2.运行cAdvisor
[root@elk92 ~]# docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=18080:8080 \
--detach=true \
--name=cadvisor \
--privileged \
--device=/dev/kmsg \
gcr.io/cadvisor/cadvisor-amd64:v0.52.1
[root@elk92 ~]# docker ps -l
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3857cc9ff34e gcr.io/cadvisor/cadvisor-amd64:v0.52.1 "/usr/bin/cadvisor -…" 2 minutes ago Up About a minute (healthy) 0.0.0.0:18080->8080/tcp, :::18080->8080/tcp cadvisor
[root@elk92 ~]#
[root@elk92 ~]# curl -s http://10.0.0.92:18080/metrics | wc -l
4885
3.访问cAdVisor的webUI测试
http://10.0.0.92:18080/containers/
4.配置Prometheus监控docker容器
[root@prometheus-server31 ~]# tail -6 /dezyan/softwares/prometheus-2.53.4.linux-amd64/prometheus.yml
- job_name: docker-cadVisor-exporter
static_configs:
- targets:
- 10.0.0.92:18080
- 10.0.0.93:18080
[root@prometheus-server31 ~]# curl -X POST http://10.0.0.31:9090/-/reload
5.验证Prometheus配置是否生效
http://10.0.0.31:9090/targets?search=
6.grafana导入模板ID
11600
本文来自博客园,作者:丁志岩,转载请注明原文链接:https://www.cnblogs.com/dezyan/p/18844184

浙公网安备 33010602011771号