故障处理:troubleshooting gc buffer busy acquire

故障处理:troubleshooting gc buffer busy acquire

我们的文章会在微信公众号IT民工的龙马人生博客网站( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。

本文转自朋友的真实案例分享。

案例:troubleshooting gc buffer busy acquire

本案例来自西区某电力客户的3节点rac一体机,频繁的出现业务性能缓慢的问题。

awr基本信息如下:

Host Name	Platform	CPUs	Cores	Sockets	Memory (GB)
xxxxx	Linux x86 64-bit	192	96	8	504.63
Snap Id	Snap Time	Sessions	Cursors/Session	Instances
Begin Snap:	62201	02-Apr-22 08:00:08	1824	5.6	3
End Snap:	62205	02-Apr-22 10:00:07	2963	6.5	3
Elapsed:		119.99 (mins)			
DB Time:		45,761.81 (mins)			
Top 10 Foreground Events by Total Wait Time


Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
gc buffer busy acquire	13,897,498	589.2K	42	21.5	Cluster
enq: US – contention	175,549	425K	2421	15.5	Other
gc cr grant 2-way	14,038,191	337.2K	24	12.3	Cluster
enq: TX – index contention	87,093	172.4K	1979	6.3	Concurrency
row cache lock	473,144	157K	332	5.7	Concurrency
gc cr multi block request	1,855,834	138.6K	75	5.0	Cluster
DB CPU		133.2K		4.9	
enq: TA – contention	968	111.9K	115571	4.1	Other
buffer busy waits	259,371	103.3K	398	3.8	Concurrency
gc buffer busy release	210,472	100.3K	476	3.7	Cluster

可以看到大量的gc相关event等待,并且延迟也很高。通常分析gc问题我都习惯于去看看awr的RAC Statistics

Global Cache Load Profile

Per Second	Per Transaction
Global Cache blocks received:	706.43	17.77
Global Cache blocks served:	895.22	22.52
GCS/GES messages received:	12,705.21	319.57
GCS/GES messages sent:	15,198.13	382.27
DBWR Fusion writes:	43.44	1.09
Estd Interconnect traffic (KB)	18,263.06	
Global Cache Efficiency Percentages (Target local+remote 100%)

Buffer access – local cache %:	99.10
Buffer access – remote cache %:	0.07
Buffer access – disk %:	0.83
Global Cache and Enqueue Services – Workload Characteristics

Avg global enqueue get time (ms):	19.9
Avg global cache cr block receive time (ms):	25.2
Avg global cache current block receive time (ms):	71.3
Avg global cache cr block build time (ms):	0.0
Avg global cache cr block send time (ms):	0.0
Global cache log flushes for cr blocks served %:	7.4
Avg global cache cr block flush time (ms):	2.5
Avg global cache current block pin time (ms):	1.7
Avg global cache current block send time (ms):	0.0
Global cache log flushes for current blocks served %:	6.0
Avg global cache current block flush time (ms):	2.8
Global Cache and Enqueue Services – Messaging Statistics

Avg message sent queue time (ms):	0.0
Avg message sent queue time on ksxp (ms):	111.0
Avg message received queue time (ms):	0.0
Avg GCS message process time (ms):	0.0
Avg GES message process time (ms):	0.1
% of direct sent messages:	33.27
% of indirect sent messages:	62.49
% of flow controlled messages:	4.24

私网流量18M/s,对于一体机的ib来说并不算什么,看到Avg message sent queue time on ksxp (ms)延迟非常高,awr就基本不用看了可以定位私网性能出现了严重的问题。

Avg message sent queue time on ksxp The time when the peer receives the message and returns ACK. This indicator is very important and directly reflects the network delay, generally less than 1ms

Interconnect Device Statistics
Throughput and errors of interconnect devices (at OS level)
All throughput numbers are megabytes per second
Device Name	IP Address	Public	Source	Send Mbytes/sec	Send Errors	Send Dropped	Send Buffer Overrun	Send Carrier Lost	Receive Mbytes/sec	Receive Errors	Receive Dropped	Receive Buffer Overrun	Receive Frame Errors
bond0	10.xxx.xxx.x		OS dependent software	16.28	0	0	0	0	12.75	0	0	37,782	0

通过awr私网设备的统计信息可以看到Receive Buffer Overrun很高,本来以为又是udp receive buffer不足的问题,正准备结案时,再仔细一看发现了一个奇怪的现象,db层面的私网设备居然是bond0,并不是我们正常的一体机的ib,ip也不是私网ip,而是public network。

这是怎么回事呢?

继续排查发现,集群层面通信使用的都是ib网卡上的haip,都是正确的,但是db实例使用的是public network,在排除db实例设置了参数cluster_interconnects之后,可以去看看alert日志,当db实例启动时会输出私网的ip和协议,从中我们发现故事得从20年的8月29日说起。。。

Sat Aug 29 00:42:55 2020
Cluster communication is configured to use the following interface(s) for this instance
  169.254.104.231
  169.254.209.215
cluster interconnect IPC version:Oracle UDP/IP (generic)
 
Sat Aug 29 01:06:19 2020
Cluster communication is configured to use the following interface(s) for this instance
  10.185.241.176
cluster interconnect IPC version:Oracle UDP/IP (generic)

可以看到20年8月29日短时间内重启了两次db实例,第二次启动时,私网的ip发生了改变。继续分析

2020-08-29 01:06:12.798: 
[USER(74999)]CRS-2316:Fatal error: cannot initialize GPnP, CLSGPNP_ERR (Generic GPnP error). 
kggpnpInit: failed to init gpnp
  WARNING: No cluster interconnect has been specified. Depending on
           the communication driver configured Oracle cluster traffic
           may be directed to the public interface of this machine.
           Oracle recommends that RAC clustered databases be configured
           with a private interconnect for enhanced security and
           performance.

看到了一段核心错误,无法初始化gpnp从而导致从gpnp profile中获取公网私网信息失败。

在查看db的patch时,发现db打patch的时间与私网第一次发生异常的时间点非常吻合

Interim patches (2) :
 
Patch  31103343     : applied on Sat Aug 29 01:05:03 CST 2020
Unique Patch ID:  23687560
Patch description:  "Database Patch Set Update : 11.2.0.4.200714 (31103343)"
查阅mos文档,有一篇比较吻合,11.2.0.4 RAC Database Instance Fails to Start with “No connectivity to other instances in the cluster during startup” After Applying 11.2.0.4 DB PSU (Doc ID 2471441.1)

CHANGES
DB PSU only was applied to 11.2.0.4 db home
 
CAUSE
For a database instance running in a RAC cluster, both DB PSU component and OCW PSU component of the GI PSU must be applied.
 
In this case, only the DB PSU component was applied to the DB home without applying the OCW PSU component of the GI PSU.
 
Verify using "opatch lsniventory" command on the database home.

很巧,本案例也是只打了db patch,并没有打gi的patch。

在实际的生产维护中,经常会看到gi和dbpsu不一致、gi和db版本不一致等等奇葩操作,强烈建议使用同版本同patch的gi和db软件,这会减少很多不必要的故障和麻烦。

------------------作者介绍-----------------------
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)

posted @ 2025-07-15 17:26  认真就输  阅读(28)  评论(0)    收藏  举报