故障诊断:CSSD进程HANG导致RAC节点重启

故障诊断:CSSD进程HANG导致RAC节点重启

我们的文章会在微信公众号IT民工的龙马人生博客网站( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。

下面是模拟主机OCSSD.LOG进程HANG住导致主机重启

1,环境介绍

[root@cisser2 ~]# crsctl query crs activeversion 
CRS active version on the cluster is [10.2.0.5.0] 
[root@cisser2 ~]# lsb_release -a 
LSB Version:    :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch 
Distributor ID: RedHatEnterpriseServer 
Description:    Red Hat Enterprise Linux Server release 5.11 (Tikanga) 
Release:        5.11 
Codename:       Tikanga 

2,手动暂停OCSSD进程

[root@cisser1 ~]# ps -ef|grep d.bin 
oracle    4666  4665  0 11:23 ?        00:00:00 /oracle/app/oracle/product/10.2.0/crs_1/bin/evmd.bin 
root      4746  4063  0 11:23 ?        00:00:00 /oracle/app/oracle/product/10.2.0/crs_1/bin/crsd.bin reboot 
root      5258  4874  0 11:23 ?        00:00:00 /oracle/app/oracle/product/10.2.0/crs_1/bin/oprocd.bin run -t 1000 -m 500 -f 
oracle    5348  4903  0 11:23 ?        00:00:00 /oracle/app/oracle/product/10.2.0/crs_1/bin/ocssd.bin 
root     14621 13718  0 11:44 pts/1    00:00:00 grep d.bin 
[root@cisser1 ~]# kill -19 5348 

30S后会生成下面的日志

3,主机messages日志

Mar 29 11:45:23 cisser1 logger: Oracle clsomon failed with fatal status 13. 

这里看到状态为13,这里由于status代码不通,可能出错的原因不通,下面是常见的代码说明。

  /* 10-39 are reserved for various kinds of steady state errors 
   * i.e. anything that comes after the group registration. 
   */ 
  clssomonretMEM      = 11,  /* memory allocation failure */ 
  clssomonretCSS      = 12,  /* misc error in CSS layer */ 
  clssomonretFATAL    = 13,  /* failure in CSS layer that should cause a reboot*/ 
  clssomonretOCR      = 14,  /* misc error in OCR layer */ 
  clssomonretOSD      = 15,  /* error in OSD layer used by generic code*/ 
  /* 40-69 are reserved for various kinds of initialization errors 
   * i.e. anything that comes before the group registration. 
   */ 
  clssomonretCRSHOME  = 40,  /* CRS home is unavailable. */ 
  clssomonretHOSTNAME = 42,  /* unable to fetch hostname */ 
  clssomonretSTDERR   = 43,  /* failure redirecting stderr */ 
  clssomonretSTDOUT   = 44,  /* failure redirecting stdout */ 
  clssomonretCHDIR    = 45,  /* failure redirecting corefile */ 
  clssomonretARGS     = 50,  /* error processing arguments */ 
  clssomonretCSSINIT  = 51,  /* failure initializing CSS-objects/APIs */ 
  clssomonretCSSINIT  = 51,  /* failure initializing CSS-objects/APIs */ 
  clssomonretOCRINIT  = 52,  /* failure initializing OCR-objects/APIs */ 
  clssomonretOSDINIT  = 53,  /* error in OSD layer used by generic code*/ 
  clssomonretMEMINIT  = 54,  /* unable to allocate memory during init */ 
  clssomonretREINIT   = 55,  /* exceeded the CSS context reinit limit */ 
  clssomonretINUSE    = 56,  /* duplicate oclsomon found */ 

4,ocssd于oclsomon日志

由于ocssd进程已经暂停了,所有ocssd没有任何日志信息
下面是clsomon日志信息

[root@cisser1 ~]#  tail -f  /oracle/app/oracle/product/10.2.0/crs_1/log/cisser1/cssd/oclsomon/oclsomon.log 
2015-03-29 11:23:05.781: clsc_connect: (0x8467d70) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_cisser1_)) 
2015-03-29 11:23:08.435: clsc_connect: (0x8466450) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_cisser1_)) 
2015-03-29 11:23:15.684 clssomon: end of cssinit, status 0 
2015-03-29 11:23:15.685 Reconfig event. (1/1/1) 
2015-03-29 11:23:16.186 Reconfig event. (2/2/1) 
2015-03-29 11:45:23.250 clssomon: Timeout waiting for CSS response. 

5,节点2的日志信息

5.1 ocssd日志信息

[    CSSD]2015-03-29 11:44:55.852 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 50% heartbeat fatal, eviction in 29.810 seconds seedhbimpd 0 
[    CSSD]2015-03-29 11:44:55.852 [633092416] >TRACE:   clssnmPollingThread: node cisser1 (1) is impending reconfig, flag 1039, misstime 30190 
[    CSSD]2015-03-29 11:44:55.852 [633092416] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1) 
[    CSSD]2015-03-29 11:44:56.854 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 50% heartbeat fatal, eviction in 28.810 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:44:58.709 [643582272] >TRACE:   clssnmSendingThread: sending status msg to all nodes 
[    CSSD]2015-03-29 11:44:58.709 [643582272] >TRACE:   clssnmSendingThread: sent 5 status msgs to all nodes 
[    CSSD]2015-03-29 11:45:02.716 [643582272] >TRACE:   clssnmSendingThread: sending status msg to all nodes 
[    CSSD]2015-03-29 11:45:02.716 [643582272] >TRACE:   clssnmSendingThread: sent 4 status msgs to all nodes 
[    CSSD]2015-03-29 11:45:07.725 [643582272] >TRACE:   clssnmSendingThread: sending status msg to all nodes 
[    CSSD]2015-03-29 11:45:07.725 [643582272] >TRACE:   clssnmSendingThread: sent 5 status msgs to all nodes 
[    CSSD]2015-03-29 11:45:10.855 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 75% heartbeat fatal, eviction in 14.810 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:45:11.857 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 75% heartbeat fatal, eviction in 13.810 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:45:12.733 [643582272] >TRACE:   clssnmSendingThread: sending status msg to all nodes 
[    CSSD]2015-03-29 11:45:12.733 [643582272] >TRACE:   clssnmSendingThread: sent 5 status msgs to all nodes 
[    CSSD]2015-03-29 11:45:16.738 [643582272] >TRACE:   clssnmSendingThread: sending status msg to all nodes 
[    CSSD]2015-03-29 11:45:16.738 [643582272] >TRACE:   clssnmSendingThread: sent 4 status msgs to all nodes 
[    CSSD]2015-03-29 11:45:19.858 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 5.810 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:45:20.744 [643582272] >TRACE:   clssnmSendingThread: sending status msg to all nodes 
[    CSSD]2015-03-29 11:45:20.744 [643582272] >TRACE:   clssnmSendingThread: sent 4 status msgs to all nodes 
[    CSSD]2015-03-29 11:45:20.859 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 4.810 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:45:21.860 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 3.810 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:45:22.673 [579823936] >TRACE:   clssgmAllocateRPCIndex: allocated rpc 326 (0x2b7b1f5a1310) 
[    CSSD]2015-03-29 11:45:22.673 [579823936] >TRACE:   clssgmRPC: rpc 0x2b7b1f5a1310 (RPC#326) tag(146002a) sent to node 1 
[    CSSD]2015-03-29 11:45:22.861 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 2.810 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:45:23.863 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 1.800 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:45:24.855 [633092416] >WARNING: clssnmPollingThread: node cisser1 (1) at 90% heartbeat fatal, eviction in 0.810 seconds seedhbimpd 1 
[    CSSD]2015-03-29 11:45:25.667 [633092416] >TRACE:   clssnmPollingThread: Eviction started for node cisser1 (1), flags 0x040f, state 3, wt4c 0 seedhbimpd 1 

这里看以看到主机2在11:45:25的时候开始驱除主机1,但是主机在11:45:23分的时候就开始重启主机了,所以主机重启由于oclsomon进程导致的,而不是节点驱除导致的。

------------------作者介绍-----------------------
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)

posted @ 2025-07-02 14:15  认真就输  阅读(10)  评论(0)    收藏  举报