前言
本文总结了一些常见的线上应急现象和对应排查步骤和工具。分享的主要目的是想让对线上问题接触少的同学有个预先认知,免得在遇到实际问题时手忙脚乱。毕竟作者自己也是从手忙脚乱时走过来的。
只不过这里先提示一下。在线上应急过程中要记住,只有一个总体目标:尽快恢复服务,消除影响。 不管处于应急的哪个阶段,我们首先必须想到的是恢复问题,恢复问题不一定能够定位问题,也不一定有完美的解决方案,也许是通过经验判断,也许是预设开关等,但都可能让我们达到快速恢复的目的,然后保留部分现场,再去定位问题、解决问题和复盘。
在大多数情况下,我们都是先优先恢复服务,保留下当时的异常信息(内存dump、线程dump、gc log等等,在紧急情况下甚至可以不用保留,等到事后去复现),等到服务正常,再去复盘问题。

好,现在让我们进入正题吧。
常见现象:CPU 利用率高/飙升
场景预设:
监控系统突然告警,提示服务器负载异常。
预先说明:
CPU飙升只是一种现象,其中具体的问题可能有很多种,这里只是借这个现象切入。
注:CPU使用率是衡量系统繁忙程度的重要指标。但是CPU使用率的安全阈值是相对的,取决于你的系统的IO密集型还是计算密集型。一般计算密集型应用CPU使用率偏高load偏低,IO密集型相反。
常见原因:
频繁 gc
死循环、线程阻塞、io wait…etc
模拟
这里为了演示,用一个最简单的死循环来模拟CPU飙升的场景,下面是模拟代码,
在一个最简单的SpringBoot Web 项目中增加CpuReaper这个类,
/**
* 模拟 cpu 飙升场景
* @author Richard_yyf
*/
@Component
public class CpuReaper {
@PostConstruct
public void cpuReaper() {
int num = 0;
long start = System.currentTimeMillis() / 1000;
while (true) {
num = num + 1;
if (num == Integer.MAX_VALUE) {
System.out.println("reset");
num = 0;
}
if ((System.currentTimeMillis() / 1000) - start > 1000) {
return;
}
}
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
打包成jar之后,在服务器上运行。java -jar cpu-reaper.jar &
第一步:定位出问题的线程
方法 a: 传统的方法
top定位CPU 最高的进程
执行top命令,查看所有进程占系统CPU的排序,定位是哪个进程搞的鬼。在本例中就是咱们的java进程。PID那一列就是进程号。(对指示符含义不清楚的见【附录】)
2. top -Hp pid定位使用 CPU 最高的线程
printf '0x%x' tid线程 id 转化 16 进制
> printf '0x%x' 12817
> 0x3211
- 1
- 2
jstack pid | grep tid找到线程堆栈
> jstack 12816 | grep 0x3211 -A 30
- 1

小插曲:
非 Java 应用可使用 perf查看:
perf top -p 线程号
可使用yum install perf安装
perf是Linux 2.6+内核中的一个工具,在内核源码包中的位置 tools/perf。
perf利用Linux的trace特性,可以用于实时跟踪,统计event计数(perf stat);或者使用采样(perf record),报告(perf report|script|annotate)的使用方式进行诊断。
方法 b: show-busy-java-threads
这个脚本来自于github上一个开源项目,项目提供了很多有用的脚本,show-busy-java-threads就是其中的一个。使用这个脚本,可以直接简化方法A中的繁琐步骤。如下:
> wget --no-check-certificate https://raw.github.com/oldratlee/useful-scripts/release-2.x/bin/show-busy-java-threads
> chmod +x show-busy-java-threads
> ./show-busy-java-threads
- 1
- 2
- 3
show-busy-java-threads脚本
#!/bin/bash
# @Function
# Find out the highest cpu consumed threads of java processes, and print the stack of these threads.
#
# @Usage
# $ ./show-busy-java-threads
#
# @online-doc https://github.com/oldratlee/useful-scripts/blob/dev-2.x/docs/java.md#-show-busy-java-threads
# @author Jerry Lee (oldratlee at gmail dot com)
# @author superhj1987 (superhj1987 at 126 dot com)
readonly PROG="`basename $0`"
readonly -a COMMAND_LINE=("$0" "$@")
# Get current user name via whoami command
# See https://www.lifewire.com/current-linux-user-whoami-command-3867579
# Because if run command by `sudo -u`, env var $USER is not rewritten/correct, just inherited from outside!
readonly USER="`whoami`"
################################################################################
# util functions
################################################################################
# NOTE: $'foo' is the escape sequence syntax of bash
readonly ec=$'\033' # escape char
readonly eend=$'\033[0m' # escape end
colorEcho() {
local color=$1
shift
# if stdout is console, turn on color output.
[ -t 1 ] && echo "$ec[1;${color}m$@$eend" || echo "$@"
}
colorPrint() {
local color=$1
shift
colorEcho "$color" "$@"
[ -n "$append_file" -a -w "$append_file" ] && echo "$@" >> "$append_file"
[ -n "$store_dir" -a -w "$store_dir" ] && echo "$@" >> "${store_file_prefix}$PROG"
}
normalPrint() {
echo "$@"
[ -n "$append_file" -a -w "$append_file" ] && echo "$@" >> "$append_file"
[ -n "$store_dir" -a -w "$store_dir" ] && echo "$@" >> "${store_file_prefix}$PROG"
}
redPrint() {
colorPrint 31 "$@"
}
greenPrint() {
colorPrint 32 "$@"
}
yellowPrint() {
colorPrint 33 "$@"
}
bluePrint() {
colorPrint 36 "$@"
}
die() {
redPrint "Error: $@" 1>&2
exit 1
}
logAndRun() {
echo "$@"
echo
"$@"
}
logAndCat() {
echo "$@"
echo
cat
}
usage() {
local -r exit_code="$1"
shift
[ -n "$exit_code" -a "$exit_code" != 0 ] && local -r out=/dev/stderr || local -r out=/dev/stdout
(( $# > 0 )) && { echo "$@"; echo; } > $out
> $out cat <<EOF
Usage: ${PROG} [OPTION]... [delay [count]]
Find out the highest cpu consumed threads of java processes,
and print the stack of these threads.
Example:
${PROG} # show busy java threads info
${PROG} 1 # update every 1 second, (stop by eg: CTRL+C)
${PROG} 3 10 # update every 3 seconds, update 10 times
Output control:
-p, --pid <java pid> find out the highest cpu consumed threads from
the specified java process.
default from all java process.
-c, --count <num> set the thread count to show, default is 5.
-a, --append-file <file> specifies the file to append output as log.
-S, --store-dir <dir> specifies the directory for storing
the intermediate files, and keep files.
default store intermediate files at tmp dir,
and auto remove after run. use this option to keep
files so as to review jstack/top/ps output later.
delay the delay between updates in seconds.
count the number of updates.
delay/count arguments imitates the style of
vmstat command.
jstack control:
-s, --jstack-path <path> specifies the path of jstack command.
-F, --force set jstack to force a thread dump. use when jstack
does not respond (process is hung).
-m, --mix-native-frames set jstack to print both java and native frames
(mixed mode).
-l, --lock-info set jstack with long listing.
prints additional information about locks.
CPU usage calculation control:
-d, --top-delay specifies the delay between top samples.
default is 0.5 (second). get thread cpu percentage
during this delay interval.
more info see top -d option. eg: -d 1 (1 second).
-P, --use-ps use ps command to find busy thread(cpu usage)
instead of top command.
default use top command, because cpu usage of
ps command is expressed as the percentage of
time spent running during the *entire lifetime*
of a process, this is not ideal in general.
Miscellaneous:
-h, --help display this help and exit.
EOF
exit $exit_code
}
################################################################################
# Check os support
################################################################################
uname | grep '^Linux' -q || die "$PROG only support Linux, not support `uname` yet!"
################################################################################
# parse options
################################################################################
# NOTE: ARGS can not be declared as readonly!!
# readonly declaration make exit code of assignment to be always 0, aka. the exit code of `getopt` in subshell is discarded.
# tested on bash 4.2.46
ARGS=`getopt -n "$PROG" -a -o p:c:a:s:S:Pd:Fmlh -l count:,pid:,append-file:,jstack-path:,store-dir:,use-ps,top-delay:,force,mix-native-frames,lock-info,help --