My Life My Dream!

守信 求实 好学 力行
  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

Java运行时问题分析利器--Arthas入门

Posted on 2022-12-06 14:18  召冠  阅读(178)  评论(0编辑  收藏  举报

Arthas 是一款线上监控诊断产品,通过全局视角实时查看应用 load、内存、gc、线程的状态信息,并能在不修改应用代码的情况下,对业务问题进行诊断,包括查看方法调用的出入参、异常,监测方法执行耗时,类加载信息等,大大提升线上问题排查效率。

https://arthas.aliyun.com/doc/

1. 启动arthas

在命令行下面执行(使用和目标进程一致的用户及jdk启动,否则可能attach失败):

java -jar arthas-boot.jar

  • 执行该程序的用户需要和目标进程具有相同的权限。比如以 admin 用户来执行: sudo su admin && java -jar arthas-boot.jar 或 sudo -u admin -EH java -jar arthas-boot.jar 。

  • 如果attach不上目标进程,可以查看 ~/logs/arthas/ 目录下的日志。

  • java -jar arthas-boot.jar -h 打印更多参数信息。

选择应用java进程:

$ java -jar arthas-boot.jar
[1]: 35542
[2]: 71560 math-game.jar

 

math-game 进程是第2个,则输入2,再输入 回车/enter 。Arthas会attach到目标进程上,并输出日志。

 

2. 通过stack命令来获取指定方法被调用的执行路径、堆栈

$ stack demo.MathGame primeFactors
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 36 ms.
ts=2018-12-04 01:32:19;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
    @demo.MathGame.run()
        at demo.MathGame.main(MathGame.java:16)

 

据条件表达式来过滤
$ stack demo.MathGame primeFactors 'params[0]<0' -n 2
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 30 ms.
ts=2018-12-04 01:34:27;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
    @demo.MathGame.run()
        at demo.MathGame.main(MathGame.java:16)
ts=2018-12-04 01:34:30;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
    @demo.MathGame.run()
        at demo.MathGame.main(MathGame.java:16)
Command execution times exceed limit: 2, so command will exit. You can set it with -n option.

 

据执行时间来过滤
$ stack demo.MathGame primeFactors '#cost>5'
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 35 ms.
ts=2018-12-04 01:35:58;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
    @demo.MathGame.run()
        at demo.MathGame.main(MathGame.java:16)

 

3. watch  能观察到的范围为:返回值、抛出异常、入参,支持 OGNL 表达式进行变量的查看

通过watch命令来查看 demo.MathGame#primeFactors 函数的入参、属性值、返回值, x为深度,b调用前,s返回后,n监控次数

$ watch demo.MathGame primeFactors "{params,target,returnObj}" -x 2 -b -s -n 2
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 46 ms.
ts=2018-12-03 19:29:54; [cost=0.01696ms] result=@ArrayList[
    @Object[][
        @Integer[1],
    ],
    @MathGame[
        random=@Random[java.util.Random@522b408a],
        illegalArgumentCount=@Integer[13038],
    ],
    null,
]ts=2018-12-03 19:29:54; [cost=4.277392ms] result=@ArrayList[
    @Object[][
        @Integer[1],
    ],
    @MathGame[
        random=@Random[java.util.Random@522b408a],
        illegalArgumentCount=@Integer[13038],
    ],
    @ArrayList[
        @Integer[2],
        @Integer[2],
        @Integer[2],
        @Integer[5],
        @Integer[5],
        @Integer[73],
        @Integer[241],
        @Integer[439],
    ],
]

 

4. trace 方法内部调用路径,并输出方法路径上的每个节点上耗时

$ trace demo.MathGame run
Press Q or Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 28 ms.
`---ts=2019-12-04 00:45:08;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
    `---[0.617465ms] demo.MathGame:run()
        `---[0.078946ms] demo.MathGame:primeFactors() #24 [throws Exception]
`---ts=2019-12-04 00:45:09;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
    `---[1.276874ms] demo.MathGame:run()
        `---[0.03752ms] demo.MathGame:primeFactors() #24 [throws Exception]

 

注意,trace 命令只会 trace 匹配到的函数里的子调用,并不会向下 trace 多层。因为 trace 是代价比较贵的,多层 trace 可能会导致最终要 trace 的类和函数非常多。

可以用正则表匹配路径上的多个类和函数,一定程度上达到多层 trace 的效果。

$ trace -E com.test.ClassA|org.test.ClassB   method1|method2|method3
[arthas@2236726]$ trace --skipJDKMethod false -E com.example.cloudbus.CloudBusApplication tkkk22|tkkk31|tkkk32|tkkk33
Press Q or Ctrl+C to abort.
Affect(class count: 1 , method count: 4) cost in 180 ms, listenerId: 1
`---ts=2023-07-14 11:41:00;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=org.springframework.boot.loader.LaunchedURLClassLoader@7daf6ecc
    `---[17624.395668ms] com.example.cloudbus.CloudBusApplication:tkkk22()
        +---[0.053865ms] java.lang.System:currentTimeMillis() #30
        +---[min=0.01425ms,max=0.037236ms,total=0.308621ms,count=17] java.util.Random:<init>() #31
        +---[min=0.004275ms,max=0.021353ms,total=0.101463ms,count=17] java.util.Random:nextInt() #57
        +---[min=1087.430621ms,max=1155.758649ms,total=17622.798142ms,count=16] com.example.cloudbus.CloudBusApplication:tkkk31() #32
        |   `---[min=1087.425581ms,max=1155.731992ms,total=17622.68804ms,count=16] com.example.cloudbus.CloudBusApplication:tkkk31()
        |       +---[min=0.058004ms,max=0.170134ms,total=1.453502ms,count=16] java.io.PrintStream:println() #38
        |       `---[min=1087.314568ms,max=1155.563051ms,total=17621.076907ms,count=16] com.example.cloudbus.CloudBusApplication:tkkk32() #39
        |           `---[min=1087.309887ms,max=1155.537419ms,total=17620.966863ms,count=16] com.example.cloudbus.CloudBusApplication:tkkk32()
        |               +---[min=0.015758ms,max=0.056775ms,total=0.4743ms,count=16] java.io.PrintStream:println() #43
        |               `---[min=1.01807ms,max=9.912311ms,total=17578.455042ms,count=15984] com.example.cloudbus.CloudBusApplication:tkkk33() #46
        |                   `---[min=1.014606ms,max=9.900366ms,total=17483.734923ms,count=15984] com.example.cloudbus.CloudBusApplication:tkkk33()
        |                       `---[min=1.012647ms,max=9.895729ms,total=17417.478916ms,count=15984] java.lang.Thread:sleep() #56
        +---[0.008009ms] java.lang.StringBuilder:<init>() #34
        +---[min=0.002087ms,max=0.013849ms,total=0.026167ms,count=3] java.lang.StringBuilder:append() #57
        +---[0.002482ms] java.lang.System:currentTimeMillis() #57
        +---[0.004255ms] java.lang.StringBuilder:toString() #57
        `---[0.067332ms] java.io.PrintStream:println() #57
 

5. monitor  对匹配的类、方法、条件表达式进行监控(非实时)

-c 是统计周期内的执行次数、成功次数、失败次数、平均耗时、失败率, 直到用户输入 Ctrl+C 为止

$ monitor -c 5 demo.MathGame primeFactors
Press Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 94 ms.
timestamp            class          method        total  success  fail  avg-rt(ms)  fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:38  demo.MathGame  primeFactors  5      1        4     1.15        80.00%
timestamp            class          method        total  success  fail  avg-rt(ms)  fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:43  demo.MathGame  primeFactors  5      3        2     42.29       40.00%
timestamp            class          method        total  success  fail  avg-rt(ms)  fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:48  demo.MathGame  primeFactors  5      3        2     67.92       40.00%
timestamp            class          method        total  success  fail  avg-rt(ms)  fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:53  demo.MathGame  primeFactors  5      2        3     0.25        60.00%
timestamp            class          method        total  success  fail  avg-rt(ms)  fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:06:58  demo.MathGame  primeFactors  1      1        0     0.45        0.00%
timestamp            class          method        total  success  fail  avg-rt(ms)  fail-rate
-----------------------------------------------------------------------------------------------
2018-12-03 19:07:03  demo.MathGame  primeFactors  2      2        0     3182.72     0.00%

 

6. profiler  通过对应用不断的采样,然后把收集到的采样结果生成火焰图

$ profiler start
Started [cpu] profiling


$ profiler stop --format html
profiler output file: /tmp/test/arthas-output/20230517-111550.html

另外,profiler start 支持--event 参数,常用的就是cpu、alloc、wall,默认为cpu仅统计cpu消耗的采样, 而wall表示挂钟时间,包括running、sleeping、blocked时间,命令示例:
$ profiler start --event wall
Started [wall] profiling

$ profiler stop --format html

 

7.  stop 关闭 Arthas 服务端,所有 Arthas 客户端全部退出。

[arthas@16960]$ stop

    Resetting all enhanced classes ...
    Affect(class count: 0 , method count: 0) cost in 1 ms, listenerId: 0
    Arthas Server is going to shutdown...
   [arthas@16960]$ session (dd8ad595-1157-4728-8038-f3c29e46a4a8) is closed because server is going to shutdown.

[root@xxxx]$