一次误操作导致的gi psu升级失败 - Ivo落班

公告

一次误操作导致的gi psu升级失败

oracle使用opatch auto的方式安装gi psu时需要一个节点一个节点来，昨晚的升级中，因为误操作而是两节点同时安装gi psu，最终在补丁安装完成后，无法拉起crs。

选择进行补丁的rollback，结果悲剧的发现rollback的前提是需要crs启动的状态，无奈之下只能进行备份文件的恢复了。

不过因为意识的疏忽，压缩$oracle_home目录和$grid_home目录时没有使用root用户，导致部分文件没有备份出来。

以后打类似的psu，有两个注意点：

第一，一定要一个节点一个节点打；

第二，一定要用root用户将$grid_home和$oracle_home 以及oraInventory目录压缩打包，一般不会用到它们，但是一旦需要使用，那就是最后的手段了。

以下是部分记录

在23：20时发现节点二在节点一还没有打完gi psu的情况下，开始安装gi psu了，应该是我的误操作导致。

23：40分左右，节点一，节点二依次打完补丁，出现以下提示

Oracle Grid Infrastructure stack start initiated but failed to complete at /tmp/p17735354_112030_Linux-x86-64/17592127/files/crs/install/crsconfig_lib.pm line 11645.

opatch是卡在执行以下脚本时出错rootcrs.pl -patch

启动crs失败，查看相应ohasd日志，发现以下错误，根据该错误找文档

Unable to start OHASD after apply of PSU patch. CLSU-00103: error location: usrgetgrp12 [1562797.1]

Unable to start CRS after applying Grid Infrastructure Patch [1200582.1]

问题类似，建议是使用软件备份来回滚

错误日志

ESOURCES] to : []

2014-06-18 23:43:59.284: [ CRSOCR][1143286080] {0:0:2} Multi Write Batch processing...

2014-06-18 23:43:59.284: [ AGFW][1141184832] {0:0:2} Agfw Proxy Server received the message: CMD_COMPLETED[Proxy] ID 20482:71

2014-06-18 23:43:59.284: [ AGFW][1141184832] {0:0:2} Agfw Proxy Server replying to the message: CMD_COMPLETED[Proxy] ID 20482:71

2014-06-18 23:43:59.286: [UiServer][1153792320] {0:0:4} processMessage called

2014-06-18 23:43:59.286: [UiServer][1153792320] {0:0:4} Sending message to PE. ctx= 0x2aaab00297c0, Client PID: 27058

2014-06-18 23:43:59.286: [UiServer][1153792320] {0:0:4} Sending command to PE: 2

2014-06-18 23:43:59.287: [ CRSPE][1151691072] {0:0:4} Processing PE command id=103. Description: [Stat Resource : 0x2aaaac0ec830]

2014-06-18 23:43:59.287: [ CRSPE][1151691072] {0:0:4} Expression Filter : (((NAME == ora.crsd) OR (NAME == ora.cssd)) OR (NAME ==

ora.evmd))

2014-06-18 23:43:59.289: [UiServer][1153792320] {0:0:4} Done for ctx=0x2aaab00297c0

2014-06-18 23:43:59.297: [UiServer][1155893568] Closed: remote end failed/disc.

2014-06-18 23:43:59.400: [ CRSOCR][1143286080] {0:0:2} Multi Write Batch done.

2014-06-18 23:43:59.400: [ CRSPE][1151691072] {0:0:2} Resource Autostart completed for gdgz-ps-tszc-db04-x3950

2014-06-18 23:43:59.455: [UiServer][1155893568] CS(0x2aaab002ba70)set Properties ( root,0x7f43170)

2014-06-18 23:43:59.455: [UiServer][1155893568] SS(0x7ebdc60)Accepted client connection: saddr =(ADDRESS=(PROTOCOL=ipc)(DEV=639)(KEY

=OHASD_UI_SOCKET))daddr = (ADDRESS=(PROTOCOL=ipc)(KEY=OHASD_UI_SOCKET))

2014-06-18 23:43:59.466: [UiServer][1153792320] {0:0:5} processMessage called

1点10分开始通过软件备份还原，1点四十，节点一软件正常，节点二响应缓慢

节点一启动时出现以下问题

startup

ORA-01078: failure in processing system parameters

ORA-01565: error in identifying file '+DATA/oltp/spfileoltp.ora'

ORA-17503: ksfdopn:2 failed to open file +DATA/oltp/spfileoltp.ora

ORA-12547: TNS: lost contact

ORA-12537 / ORA-12547 or TNS-12518 if Listener (including SCAN Listener) and Database are Owned by Different OS User (文档 ID 1069517.1)

通过以上文档，定位是$GRID_HOME/bin/oracle的权限不对

正常的是-rwsr-s--x，而该库上面是-rwxr-x--x

chmod 6751 oracle 通过该命令修改权限后成功启动

三点半时候，应用那边尝试连接，发现两个问题

ora-03113,和ora-12516

这边将节点一重启后，发现起不来了

相应ohasd日志如下：

2014-06-19 04:41:40.708: [ default][616989264] OHASD Daemon Starting. Command string :restart

2014-06-19 04:41:40.713: [ default][616989264] Initializing OLR

2014-06-19 04:41:40.714: [ OCROSD][616989264]utopen:6m': failed in stat OCR file/disk /oracle/app/11.2.0.3/grid/cdata/gdgz-ps-tszc-

db03-x3950.olr, errno=2, os err string=No such file or directory

2014-06-19 04:41:40.714: [ OCROSD][616989264]utopen:7: failed to open any OCR file/disk, errno=2, os err string=No such file or dir

ectory

2014-06-19 04:41:40.714: [ OCRRAW][616989264]proprinit: Could not open raw device

2014-06-19 04:41:40.714: [ OCRAPI][616989264]a_init:16!: Backend init unsuccessful : [26]

2014-06-19 04:41:40.714: [ CRSOCR][616989264] OCR context init failure. Error: PROCL-26: Error while accessing the physical storag

e Operating System error [No such file or directory] [2]

2014-06-19 04:41:40.715: [ default][616989264] Created alert : (:OHAS00106:) : OLR initialization failed, error: PROCL-26: Error wh

ile accessing the physical storage Operating System error [No such file or directory] [2]

2014-06-19 04:41:40.715: [ default][616989264][PANIC] OHASD exiting; Could not init OLR

2014-06-19 04:41:40.715: [ default][616989264] Done

四点半的时候节点二软件恢复完全，但是出现相同错误

2014-06-19 04:00:22.064

[ohasd(28201)]CRS-0704:Oracle High Availability Service aborted due to Oracle Local Registry error [PROCL-26: Error while accessing

the physical storage Operating System error [No such file or directory] [2]]. Details at (:OHAS00106:) in /oracle/app/11.2.0.3/grid/

log/gdgz-ps-tszc-db04-x3950/ohasd/ohasd.log.

发现是因为丢失olr文件导致该情况发生

因为该文件是root权限，所以当初没有备份成功

庆幸的是，节点二的旧目录重命名了，但是还存在。从节点二的旧软件目录中拿出该文件，放在相应目录，节点二启动成功

同时把该文件复制到节点一相应目录，修改名字后节点一也启动成功

之后有发现应用依然无法连接

如上述，将$ORACLE_HOME/bin/oracle的权限修改后重启数据库正常

这次故障给了我很大的感慨，搞db的必须要有那种泰山崩于前而面不改色的心态，两点到四点这段时间，我多数处于脑子空白状态，不知道自己干了啥，很有可能节点一上本来存在的ocr就是在这段时间被我删除，从而导致节点一重启报错的。

路漫漫其修远兮，和那些中高级相比，我只能是初级只能那么点工资不是没有道理的。

加油

posted on 2014-06-19 20:48 Ivo落班阅读(2117) 评论(0) 收藏举报

刷新页面返回顶部