02 | 基础篇:到底应该怎么理解“平均负载”?

一、平均负载与CPU使用率关系

1、平均负载

单位时间内，系统处于可运行状态和不可中断状态的平均进程数，他和CPU使用率没有直接的关系，
可运行状态：
正在使用CPU或者正在等待CPU的进程
不可中断进程
进程是正处内核关键流程的进程，并且这些流程是不可中断的状态，比如最常见的是等待硬件设备I/O响应，也就是我们在ps -aux命令中看到D状态的状态，不可中断状态实际上是系统对硬件设备的保护机制

2、当平均负载为2时，意味着什么？

既然是平均的活跃进程数，那么最理想的，就是每个CPU上都刚好运行着一个进程，这样每个CPU都得到了充分的利用，那当平均负载2，意味着什么呢？

1、在只有2个CPU的系统上，意味着所有CPU都刚好被完全占用
2、在4个CPU的系统上，意味着CPU有50%空闲
3、在1个CPU系统上，意味着有一半进程竞争不到CPU资源

二、平均负载多少比较合理

平均负载最理想的情况等于CPU个数

1、查看系统有几个CPU

grep 'model name' /proc/cpuinfo | wc -l
2

当平均负载高于CPU数量70%就可以排除负载高的问题了。一旦负载过高，就可能导致进程响应变慢，进而影响服务的正常功能

三、平均负载与CPU使用率

1、平均负载

1、正在使用CPU的进程
2、等待CPU
3、等待I/O的进程

2、CPU使用率

1、CPU密集型进程，使用大量的CPU导致系统的平均负载升高，
2、I/O 密集型进程，等待 I/O 也会导致平均负载升高，但 CPU 使用率不一定很高；
3、大量等待CPU的进程调度也会导致平均负载升高，此时的CPU 使用率也会比较高

四、实战

1、环境与测试工具

1、操作系统

[root@lhr ~]# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core)

2、测试工具

yum install stress-ng sysstat -y
rpm -qa|grep sysstat

2、场景一：CPU 密集型进程

1、窗口1

[root@luoahong ~]# stress --cpu 1 --timeout 600
--cpu：进程所使用的CPU数
--timeout 600 600s后自动断开
stress: info: [1307] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

2、窗口2

#-P ALL 表示监控所有CPU，后面数字5表示间隔5秒后输出一组数据
[root@lhr ~]# mpstat -P ALL 5
Linux 3.10.0-957.21.3.el7.x86_64 (lhr) 2019年10月29日 _x86_64_    (2 CPU)

18时23分01秒 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
18时23分06秒 all 99.90 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18时23分06秒 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18时23分06秒 1 99.80 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00

18时23分06秒 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
18时23分11秒 all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18时23分11秒 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
18时23分11秒 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

正好有一个 CPU 的使用率为 100%，但它的只有 0。这说明，平均负载的升高正是由于 CPU 使用率为 100% 。

那么，到底是哪个进程导致了 CPU 使用率为 100% 呢？你可以使用 pidstat 来查询

窗口3

[[root@lhr ~]# pidstat -u 5
Linux 3.10.0-957.21.3.el7.x86_64 (lhr) 2019年10月29日 _x86_64_    (2 CPU)

18时24分47秒 UID PID %usr %system %guest %CPU CPU Command
18时24分52秒 0 1027 0.40 0.00 0.00 0.40 1 AliYunDun
18时24分52秒 0 26190 99.80 0.00 0.00 99.80 1 stress
18时24分52秒 0 26191 99.60 0.00 0.00 99.60 0 stress

18时24分52秒 UID PID %usr %system %guest %CPU CPU Command
18时24分57秒 0 1027 0.00 0.20 0.00 0.20 1 AliYunDun
18时24分57秒 0 26132 0.20 0.00 0.00 0.20 0 sshd
18时24分57秒 0 26190 99.60 0.20 0.00 99.80 1 stress
18时24分57秒 0 26191 99.80 0.00 0.00 99.80 0 stress
18时24分57秒 0 26194 0.00 0.20 0.00 0.20 0 pidstat

平均时间: UID PID %usr %system %guest %CPU CPU Command
平均时间: 0 1027 0.20 0.10 0.00 0.30 - AliYunDun
平均时间: 0 26132 0.10 0.00 0.00 0.10 - sshd
平均时间: 0 26190 99.70 0.10 0.00 99.80 - stress
平均时间: 0 26191 99.70 0.00 0.00 99.70 - stress
平均时间: 0 26194 0.00 0.10 0.00 0.10 - pidstat

从这里可以明显看到，stress 进程的 CPU 使用率为 99.80

3、场景二：I/O 密集型进程

1、窗口1

stress-ng -i 1 --hdd 1 --timeout 600

2、窗口2

# -d 参数表示高亮显示变化的区域
$ watch -d uptime
..., load average: 2.17, 0.84, 0.40

3、窗口3

[root@lhr ~]# mpstat -P ALL 5 1
Linux 3.10.0-957.21.3.el7.x86_64 (lhr) 2019年10月29日 _x86_64_    (2 CPU)

18时27分49秒 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
18时27分54秒 all 0.51 0.00 12.90 85.06 0.00 0.00 0.00 0.00 0.00 1.54
18时27分54秒 0 0.42 0.00 10.42 87.50 0.00 0.00 0.00 0.00 0.00 1.67
18时27分54秒 1 0.40 0.00 15.52 82.86 0.00 0.00 0.00 0.00 0.00 1.21

平均时间: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
平均时间: all 0.51 0.00 12.90 85.06 0.00 0.00 0.00 0.00 0.00 1.54
平均时间: 0 0.42 0.00 10.42 87.50 0.00 0.00 0.00 0.00 0.00 1.67
平均时间: 1 0.40 0.00 15.52 82.86 0.00 0.00 0.00 0.00 0.00 1.21
1 分钟的平均负载会慢慢增加到 1.00

其中一个CPU的系统CPU使用率升高到了10.42，而iowait高达87.50%。这说明，平均负载的升高是由于iowait的升高。
那么，到底是哪个进程导致了 CPU 使用率为 100% 呢？

[root@lhr ~]# pidstat -u 5 1
Linux 3.10.0-957.21.3.el7.x86_64 (lhr) 2019年10月29日 _x86_64_    (2 CPU)

18时29分46秒 UID PID %usr %system %guest %CPU CPU Command
18时29分51秒 0 324 0.00 0.80 0.00 0.80 0 kworker/0:1H
18时29分51秒 0 329 0.00 2.20 0.00 2.20 1 jbd2/vda1-8
18时29分51秒 0 895 0.20 0.00 0.00 0.20 0 tuned
18时29分51秒 0 1027 0.40 0.00 0.00 0.40 1 AliYunDun
18时29分51秒 997 21285 0.20 0.00 0.00 0.20 0 mysqld
18时29分51秒 0 21586 0.00 2.00 0.00 2.00 1 kworker/u4:1
18时29分51秒 0 26196 0.40 20.96 0.00 21.36 0 stress-ng-hdd
18时29分51秒 0 26197 0.00 0.20 0.00 0.20 1 stress-ng-io
18时29分51秒 0 26198 0.00 0.20 0.00 0.20 0 watch

平均时间: UID PID %usr %system %guest %CPU CPU Command
平均时间: 0 324 0.00 0.80 0.00 0.80 - kworker/0:1H
平均时间: 0 329 0.00 2.20 0.00 2.20 - jbd2/vda1-8
平均时间: 0 895 0.20 0.00 0.00 0.20 - tuned
平均时间: 0 1027 0.40 0.00 0.00 0.40 - AliYunDun
平均时间: 997 21285 0.20 0.00 0.00 0.20 - mysqld
平均时间: 0 21586 0.00 2.00 0.00 2.00 - kworker/u4:1
平均时间: 0 26196 0.40 20.96 0.00 21.36 - stress-ng-hdd
平均时间: 0 26197 0.00 0.20 0.00 0.20 - stress-ng-io
平均时间: 0 26198 0.00 0.20 0.00 0.20 - watch

4、场景三：大量进程的场景

当系统中运行进程超出CPU运行能力时，就会出现等待CPU的进程。
比如，我们还是使用stress，但这次模拟的是8个进程：

1、窗口1

[root@luoahong ~]# stress -c 8 --timeout 600
stress: info: [5270] dispatching hogs: 8 cpu, 0 io, 0 vm, 0 hdd

2、窗口2

[root@luoahong ~]# uptime
21:16:07 up 1:43, 3 users, load average: 5.98, 2.14, 1.19

3、窗口3

[root@lhr ~]# pidstat -u 5 1
Linux 3.10.0-957.21.3.el7.x86_64 (lhr) 2019年10月29日 _x86_64_    (2 CPU)

18时31分45秒 UID PID %usr %system %guest %CPU CPU Command
18时31分50秒 0 1027 0.00 0.20 0.00 0.20 1 AliYunDun
18时31分50秒 0 26512 25.30 0.00 0.00 25.30 0 stress
18时31分50秒 0 26513 24.90 0.00 0.00 24.90 1 stress
18时31分50秒 0 26514 24.70 0.00 0.00 24.70 0 stress
18时31分50秒 0 26515 24.90 0.00 0.00 24.90 1 stress
18时31分50秒 0 26516 24.90 0.00 0.00 24.90 1 stress
18时31分50秒 0 26517 25.10 0.00 0.00 25.10 0 stress
18时31分50秒 0 26518 24.90 0.00 0.00 24.90 0 stress
18时31分50秒 0 26519 24.70 0.00 0.00 24.70 1 stress

平均时间: UID PID %usr %system %guest %CPU CPU Command
平均时间: 0 1027 0.00 0.20 0.00 0.20 - AliYunDun
平均时间: 0 26512 25.30 0.00 0.00 25.30 - stress
平均时间: 0 26513 24.90 0.00 0.00 24.90 - stress
平均时间: 0 26514 24.70 0.00 0.00 24.70 - stress
平均时间: 0 26515 24.90 0.00 0.00 24.90 - stress
平均时间: 0 26516 24.90 0.00 0.00 24.90 - stress
平均时间: 0 26517 25.10 0.00 0.00 25.10 - stress
平均时间: 0 26518 24.90 0.00 0.00 24.90 - stress
平均时间: 0 26519 24.70 0.00 0.00 24.70 - stress

可以看出，8 个进程在争抢 2 个 CPU，每个进程等待CPU 的时间（也就是代码块中的 %wait 列）高达 75%这些超出 CPU 计算能力的进程，最终导致 CPU 过载。

五、小结

1、平均负载高有可能是 CPU 密集型进程导致的；
2、平均负载负载高并不一定代表 CPU 使用率高，还有可能是 I/O I/O 更繁忙了
3、当发现负载高的时候，你可以使用 mpstat、pidstat等工具，辅助分析负载的来源

posted @ 2019-10-29 17:43 pluto2charon 阅读(318) 评论(2) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部