记一次GC导致线上服务超时问题

1、现象

  2024-12-28 23点左右,线上其他服务请求 content-cache 出现批量超时。

  content-cache-03 机器内存使用率如下:

  机器配置:4核8G

    

  这里因为JVM参数设置为:

-Xms4g -Xmx4g -XX:MaxNewSize=1g

  所以达到42%时,内存的使用率已经达到了3.3G。

2、数据

  查看GC日志如下:

(1)CMS老年代回收期间没有触发新生代GC

2024-12-27T12:30:51.070+0800: 4745.252: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2206547K(3145728K)] 2268778K(4089472K), 0.0037549 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
2024-12-27T12:30:51.075+0800: 4745.256: [CMS-concurrent-mark-start]
2024-12-27T12:30:51.206+0800: 4745.387: [CMS-concurrent-mark: 0.131/0.131 secs] [Times: user=0.43 sys=0.07, real=0.13 secs] 
2024-12-27T12:30:51.206+0800: 4745.388: [CMS-concurrent-preclean-start]
2024-12-27T12:30:51.214+0800: 4745.396: [CMS-concurrent-preclean: 0.008/0.008 secs] [Times: user=0.02 sys=0.01, real=0.01 secs] 
2024-12-27T12:30:51.215+0800: 4745.396: [CMS-concurrent-abortable-preclean-start]
2024-12-27T12:30:51.760+0800: 4745.941: [CMS-concurrent-abortable-preclean: 0.544/0.545 secs] [Times: user=1.22 sys=0.16, real=0.55 secs] 
2024-12-27T12:30:51.764+0800: 4745.945: [GC (CMS Final Remark) [YG occupancy: 521124 K (943744 K)]2024-12-27T12:30:51.764+0800: 4745.945: [GC (CMS Final Remark) 2024-12-27T12:30:51.764+0800: 4745.946: [ParNew: 521124K->28400K(943744K), 0.0095777 secs] 2727671K->2234947K(4089472K), 0.0103370 secs] [Times: user=0.07 sys=0.00, real=0.01 secs] 
2024-12-27T12:30:51.775+0800: 4745.956: [Rescan (parallel) , 0.0048798 secs]2024-12-27T12:30:51.779+0800: 4745.961: [weak refs processing, 0.0183285 secs]2024-12-27T12:30:51.798+0800: 4745.979: [class unloading, 0.0379042 secs]2024-12-27T12:30:51.836+0800: 4746.017: [scrub symbol table, 0.0131591 secs]2024-12-27T12:30:51.849+0800: 4746.030: [scrub string table, 0.0014270 secs][1 CMS-remark: 2206547K(3145728K)] 2234947K(4089472K), 0.1033068 secs] [Times: user=0.19 sys=0.00, real=0.10 secs] 
2024-12-27T12:30:51.868+0800: 4746.050: [CMS-concurrent-sweep-start]
2024-12-27T12:30:52.865+0800: 4747.046: [CMS-concurrent-sweep: 0.997/0.997 secs] [Times: user=2.50 sys=0.46, real=1.00 secs] 
2024-12-27T12:30:52.865+0800: 4747.047: [CMS-concurrent-reset-start]
2024-12-27T12:30:52.874+0800: 4747.056: [CMS-concurrent-reset: 0.005/0.009 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]

(2)CMS老年代回收期间发生一次新生代GC

2024-12-27T12:39:03.418+0800: 5237.599: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2221631K(3145728K)] 2326535K(4089472K), 0.0164690 secs] [Times: user=0.12 sys=0.00, real=0.02 secs] 
2024-12-27T12:39:03.435+0800: 5237.617: [CMS-concurrent-mark-start]
2024-12-27T12:39:03.659+0800: 5237.841: [CMS-concurrent-mark: 0.224/0.224 secs] [Times: user=1.14 sys=0.22, real=0.23 secs] 
2024-12-27T12:39:03.660+0800: 5237.842: [CMS-concurrent-preclean-start]
2024-12-27T12:39:03.681+0800: 5237.862: [CMS-concurrent-preclean: 0.020/0.021 secs] [Times: user=0.14 sys=0.01, real=0.02 secs] 
2024-12-27T12:39:03.681+0800: 5237.863: [CMS-concurrent-abortable-preclean-start]
2024-12-27T12:39:04.070+0800: 5238.251: [GC (Allocation Failure) 2024-12-27T12:39:04.070+0800: 5238.252: [ParNew: 943744K->85754K(943744K), 0.0289108 secs] 3165375K->2332721K(4089472K), 0.0298252 secs] [Times: user=0.16 sys=0.01, real=0.03 secs] 
2024-12-27T12:39:04.790+0800: 5238.972: [CMS-concurrent-abortable-preclean: 1.073/1.109 secs] [Times: user=2.95 sys=0.47, real=1.11 secs] 
2024-12-27T12:39:04.795+0800: 5238.976: [GC (CMS Final Remark) [YG occupancy: 556391 K (943744 K)]2024-12-27T12:39:04.795+0800: 5238.976: [GC (CMS Final Remark) 2024-12-27T12:39:04.795+0800: 5238.977: [ParNew: 556391K->33226K(943744K), 0.0142288 secs] 2803358K->2280193K(4089472K), 0.0149618 secs] [Times: user=0.09 sys=0.00, real=0.02 secs] 
2024-12-27T12:39:04.810+0800: 5238.992: [Rescan (parallel) , 0.0071785 secs]2024-12-27T12:39:04.817+0800: 5238.999: [weak refs processing, 0.0170401 secs]2024-12-27T12:39:04.834+0800: 5239.016: [class unloading, 0.0380791 secs]2024-12-27T12:39:04.872+0800: 5239.054: [scrub symbol table, 0.0136271 secs]2024-12-27T12:39:04.886+0800: 5239.068: [scrub string table, 0.0014743 secs][1 CMS-remark: 2246966K(3145728K)] 2280193K(4089472K), 0.1099299 secs] [Times: user=0.22 sys=0.00, real=0.11 secs] 
2024-12-27T12:39:04.906+0800: 5239.087: [CMS-concurrent-sweep-start]
2024-12-27T12:39:05.848+0800: 5240.030: [CMS-concurrent-sweep: 0.943/0.943 secs] [Times: user=2.13 sys=0.24, real=0.94 secs] 
2024-12-27T12:39:05.849+0800: 5240.030: [CMS-concurrent-reset-start]
2024-12-27T12:39:05.854+0800: 5240.035: [CMS-concurrent-reset: 0.005/0.005 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

(3)CMS老年代回收期间发生两次新生代GC(造成延迟的一次)

2024-12-28T23:01:04.020+0800: 128958.201: [GC (Allocation Failure) 2024-12-28T23:01:04.020+0800: 128958.202: [ParNew: 922124K->104832K(943744K), 0.1259681 secs] 3117264K->2392877K(4089472K), 0.1268294 secs] [Times: user=0.73 sys=0.00, real=0.12 secs] 
2024-12-28T23:01:04.153+0800: 128958.335: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2288045K(3145728K)] 2398447K(4089472K), 0.0167679 secs] [Times: user=0.11 sys=0.00, real=0.02 secs] 
2024-12-28T23:01:04.171+0800: 128958.352: [CMS-concurrent-mark-start]
2024-12-28T23:01:04.480+0800: 128958.661: [GC (Allocation Failure) 2024-12-28T23:01:04.480+0800: 128958.662: [ParNew: 943744K->104832K(943744K), 0.3016805 secs] 3231789K->2686743K(4089472K), 0.3025824 secs] [Times: user=1.74 sys=0.00, real=0.30 secs] 
2024-12-28T23:01:05.050+0800: 128959.231: [GC (Allocation Failure) 2024-12-28T23:01:05.050+0800: 128959.232: [ParNew: 943744K->104832K(943744K), 0.2612079 secs] 3525655K->2930593K(4089472K), 0.2621099 secs] [Times: user=1.51 sys=0.00, real=0.26 secs] 
2024-12-28T23:01:05.505+0800: 128959.687: [CMS-concurrent-mark: 0.757/1.334 secs] [Times: user=8.24 sys=0.75, real=1.33 secs] 
2024-12-28T23:01:05.506+0800: 128959.687: [CMS-concurrent-preclean-start]
2024-12-28T23:01:06.108+0800: 128960.289: [CMS-concurrent-preclean: 0.589/0.602 secs] [Times: user=2.02 sys=0.60, real=0.60 secs] 
2024-12-28T23:01:06.108+0800: 128960.290: [CMS-concurrent-abortable-preclean-start]
2024-12-28T23:01:06.108+0800: 128960.290: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2024-12-28T23:01:06.113+0800: 128960.295: [GC (CMS Final Remark) [YG occupancy: 895656 K (943744 K)]2024-12-28T23:01:06.113+0800: 128960.295: [GC (CMS Final Remark) 2024-12-28T23:01:06.114+0800: 128960.295: [ParNew: 895656K->895656K(943744K), 0.0000377 secs] 3721417K->3721417K(4089472K), 0.0006760 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2024-12-28T23:01:06.114+0800: 128960.296: [Rescan (parallel) , 0.6955777 secs]2024-12-28T23:01:06.810+0800: 128960.991: [weak refs processing, 0.0174431 secs]2024-12-28T23:01:06.827+0800: 128961.009: [class unloading, 0.0391055 secs]2024-12-28T23:01:06.867+0800: 128961.048: [scrub symbol table, 0.0128982 secs]2024-12-28T23:01:06.879+0800: 128961.061: [scrub string table, 0.0014971 secs][1 CMS-remark: 2825761K(3145728K)] 3721417K(4089472K), 0.7845246 secs] [Times: user=5.51 sys=0.00, real=0.78 secs] 
2024-12-28T23:01:06.899+0800: 128961.080: [CMS-concurrent-sweep-start]
2024-12-28T23:01:11.163+0800: 128965.345: [CMS-concurrent-sweep: 2.170/4.264 secs] [Times: user=13.16 sys=1.91, real=4.27 secs] 
2024-12-28T23:01:11.173+0800: 128965.354: [GC (Allocation Failure) 2024-12-28T23:01:11.173+0800: 128965.355: [ParNew: 943744K->103600K(943744K), 0.0749256 secs] 2090310K->1292644K(4089472K), 0.0760357 secs] [Times: user=0.50 sys=0.00, real=0.07 secs] 
2024-12-28T23:01:13.551+0800: 128967.733: [CMS-concurrent-reset-start]
2024-12-28T23:01:13.692+0800: 128967.874: [CMS-concurrent-reset: 0.005/0.141 secs] [Times: user=0.19 sys=0.00, real=0.14 secs] 

ParNew:处理新生代GC

GC (Allocation Failure):触发新生代GC的原因是由新生代内存分配失败导致

943744K->104832K(943744K):新生代GC前内存大小943744K,回收后104832K,回收量 943744K - 104832K(注意,清空的这些一部分被销毁,一部分被转往了老年代)

3231789K->2686743K(4089472K):堆内存总体情况,回收前 3231789K,回收后 2686743K

0.3025824:耗时 0.3秒

 

CMS:CMS只处理老年代的垃圾回收

• Step1:初始标记(有STW)

  CMS Initial Mark:标记 GC Roots 可以直接到达的对象。

  2288045K(3145728K):当前老年代中存活对象的大小为2288045K / 当前老年代最大可用内存容量3145728K

  2398447K(4089472K):垃圾回收前总的内存使用量(老年代+新生代)/堆内存的总大小(包括年轻代和老年代)

  0.0167679 secs:此阶段耗时0.016毫秒

  Times: user=0.11 sys=0.00, real=0.02 secs:用户态、内核态、实际耗时

• Step2:并发标记

  CMS-concurrent-mark-start:并发标记,回收线程与用户线程并行执行。

• Step3:并发预处理与可中断的并发预处理

  -XX:-CMSPrecleaningEnabled 可通过参数设置是否开启,默认开启。

  这一步的主要目的是处理新生代进入老年代的对象,尽量减少最终标记的耗时。

  CMS-concurrent-preclean-start:并发预处理开始

  CMS-concurrent-preclean: 0.008/0.008 secs:并发预处理结束,耗时 0.008 秒

  CMS-concurrent-abortable-preclean-start:可中断的并发预处理开始

  CMS-concurrent-abortable-preclean: 0.000/0.000 secs:可中断的并发预处理结束,耗时为0

• Step4:最终标记(STW)

  CMS Final Remark) [YG occupancy: 895656 K (943744 K):最终标记阶段,此时新生代内存占用 895656 K,最大可用量 943744 K

  Rescan (parallel):并行扫描存活对象,确保标记准确性,耗时 0.6955777

  weak refs processing:处理弱引用,耗时 0.0174431 秒。

  class unloading:卸载未使用的类,释放元数据空间,耗时 0.0391055 秒。

  scrub symbol table:清理符号表,释放相关内存,耗时 0.0128982 秒。

  scrub string table:清理字符串表,释放无用的字符串,耗时 0.0014971

  CMS-remark: 2825761K(3145728K):当前老年代使用量 2825761K,最大可用量 3145728K

  堆内存使用总量:3721417 KB,总容量:4089472 KB

  3721417K(4089472K):当前总内存使用量 3721417K,最大可用量 4089472K

  0.7845246 secs:重新标记阶段总耗时(0.78 秒)

• Step5:并发清理

  CMS-concurrent-sweep-start:CMS 并发清理阶段开始。

  CMS-concurrent-sweep: 2.170/4.264 secs(清理任务的耗时/总耗时,包括线程调度等系统开销)

• Step6:并发重置

  重置CMS算法相关的内部数据,为下一次GC循环做准备。

  CMS-concurrent-reset-start:CMS 并发重置阶段开始。

  [CMS-concurrent-reset: 0.005/0.141 secs]:重置结束,耗时

3、分析

  GC对系统的影响主要发生在 【初始标记】和【最终标记】两个阶段,因为这两个阶段是 STW 的。

  数据1两阶段耗时:0.0037549 + 0.1033068 = 0.107秒

  数据2两阶段耗时:0.0164690 + 0.1099299 = 0.126秒

  数据3两阶段耗时:0.1268294 + 0.7845246 = 0.911秒

问题总结:在数据3中,可以明显看到,在并发标记阶段由于新生代内存使用量增长过快,导致发生了两次新生代GC,这就造成重复标记阶段耗时增加,达到了0.78秒。

  可能造成【最终标记】阶段耗时长的原因:

a. ​晋升失败(Promotion Failure)​

  • ​定义:在Young GC过程中,如果新生代中的存活对象无法晋升到老年代(可能因为老年代空间不足),就会发生晋升失败。
  • ​后果:晋升失败会导致Young GC无法完成,进而触发一次Full GC。这实际上意味着CMS Full GC需要处理更多的存活对象,因为新生代中本应晋升的对象仍然需要被标记和处理。

b. ​引用关系变化

  • ​并发标记期间的引用变更:在并发标记阶段,应用线程仍在运行,可能会修改对象的引用关系。CMS通过写屏障(Write Barrier)记录这些变更,以便在重新标记阶段处理。
  • ​Young GC引发的额外引用变更:如果在并发标记完成后发生Young GC,这可能会引入新的引用关系变化,特别是涉及新生代与老年代之间的引用。这些变化需要在重新标记阶段被处理,增加了重新标记的工作量。

c. ​对象存活率增加

  • ​存活对象增多:由于新生代中的对象未能被清理(因为Young GC触发了Full GC),更多的对象存活下来,需要在重新标记阶段被标记为存活。这增加了重新标记阶段需要处理的对象数量,从而延长了耗时。

d. ​Full GC的全面性

  • ​整个堆的扫描:Full GC需要扫描整个堆(包括新生代和老年代),而不仅仅是老年代。这意味着重新标记阶段需要处理更多的对象,增加了整体的耗时。
  • ​分代优势的丧失:分代垃圾收集器(如CMS)通过分代策略减少了每次GC需要处理的对象数量。然而,当Full GC触发时,这种分代优势被削弱,导致处理时间增加。

4、解决思路

  解决思路就是增加新生代的大小,同时为避免新生代一次回收量过大对老年代造成压力,老年代也增加,同时降低CMS的触发门槛(由70%—>60%)

  启动参数由:

java -jar -Xms4g -Xmx4g -XX:MaxNewSize=1g -XX:MaxPermSize=1g -XX:+AggressiveOpts -XX:+UseBiasedLocking -XX:+DisableExplicitGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -Xloggc:/home/glory/dz-content-cache/logs/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseCMSCompactAtFullCollection -XX:+CMSClassUnloadingEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSScavengeBeforeRemark -Dnetworkaddress.cache.ttl=60 dz-content-cache.jar --spring.profiles.active=online

  更新为:

java -jar -Xms6g -Xmx6g -XX:MaxNewSize=2g -XX:MaxPermSize=1g -XX:+AggressiveOpts -XX:+UseBiasedLocking -XX:+DisableExplicitGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60 -Xloggc:/home/glory/dz-content-cache/logs/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseCMSCompactAtFullCollection -XX:+CMSClassUnloadingEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSScavengeBeforeRemark -Dnetworkaddress.cache.ttl=60 dz-content-cache.jar --spring.profiles.active=online

  主要调整:

  -Xms6g -Xmx6g -XX:MaxNewSize=2g XX:CMSInitiatingOccupancyFraction=60

 

posted @ 2025-03-31 21:01  jingyi_up  阅读(45)  评论(0)    收藏  举报