lock_sga这个雷子炸了
接通知某系统1号机宕机,且尝试重启5分钟又再次宕机
通过VPN登录查看数据库的告警日志
1 Fri Feb 8 20:53:19 2013 2 Errors in file /oracle/db/admin/ahsmdb/bdump/ahsmdb1_lmon_495646.trc: 3 ORA-27091: unable to queue I/O 4 ORA-27072: File I/O error 5 IBM AIX RISC System/6000 Error: 12: Not enough space 6 Additional information: 7 7 Additional information: 2224224 8 Additional information: -1 9 Fri Feb 8 20:53:19 2013 10 Errors in file /oracle/db/admin/ahsmdb/bdump/ahsmdb1_lmon_495646.trc: 11 ORA-00202: control file: '+DATA/ahsmdb/controlfile/current.297.702063033' 12 ORA-15080: synchronous I/O operation to a disk failed 13 Fri Feb 8 20:53:21 2013 14 Errors in file /oracle/db/admin/ahsmdb/bdump/ahsmdb1_ckpt_552974.trc: 15 ORA-00206: error in writing (block 3, # blocks 1) of control file 16 ORA-00202: control file: '+DATA/ahsmdb/controlfile/current.296.702063033' 17 ORA-27070: async read/write failed 18 ORA-00206: error in writing (block 3, # blocks 1) of control file 19 ORA-00202: control file: '+DATA/ahsmdb/controlfile/current.297.702063033' 20 ORA-27070: async read/write failed 21 Fri Feb 8 20:53:22 2013 22 Errors in file /oracle/db/admin/ahsmdb/bdump/ahsmdb1_ckpt_552974.trc: 23 ORA-00221: error on write to control file 24 ORA-00206: error in writing (block 3, # blocks 1) of control file 25 ORA-00202: control file: '+DATA/ahsmdb/controlfile/current.296.702063033' 26 ORA-27070: async read/write failed 27 ORA-00206: error in writing (block 3, # blocks 1) of control file 28 ORA-00202: control file: '+DATA/ahsmdb/controlfile/current.297.702063033' 29 ORA-27070: async read/write failed 30 Fri Feb 8 20:53:22 2013 31 CKPT: terminating instance due to error 221
从告警日志查看刚开始判断是盘阵的空间不足导致,通过一下命令可以查看ASM管理的盘阵使用情况:
1 $ ps -ef|grep pmon 2 oracle 249868 1 0 Jan 22 - 3:56 asm_pmon_+ASM1 3 oracle 1618094 1 0 15:15:24 - 0:00 ora_pmon_ahsmdb1 4 oracle 1622138 1646698 0 15:42:37 pts/1 0:00 grep pmon 5 $ export ORACLE_SID=+ASM1 6 $ asmcmd 7 ASMCMD> lsdg 8 State Type Rebal Unbal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Name 9 MOUNTED EXTERN N N 512 4096 1048576 409600 290397 0 290397 0 DATA/
可以查看到盘阵的剩余空间足够,排除该原因之后再看内存使用情况
将数据库起起来后,通过topas观察内存使用变化发现,物理内存使用率慢慢涨到99%,交换区涨到10之后库就挂掉了
所以觉得很奇怪,交换区没有使用完,不应该是因为内存不足原因导致数据库宕机,此时思路陷入僵局。
继续观察多次起库的后台日志,发现在第二次重启库宕机之前有个报错如下:
Sat Feb 9 13:09:48 2013 Errors in file /oracle/db/admin/ahsmdb/bdump/ahsmdb1_psp0_495796.trc: ORA-27300: OS system dependent operation:fork failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgpspawn5 ORA-27303: additional information: skgpspawn5
该报错是由于申请交换区失败导致,但之前查看交换区使用率是正常的。
扭头查看2号机的数据库参数和内存使用情况,一切正常,但sga的一个属性lock_sga=true引起了我的注意
这个属性的意思是想把sga内存锁在物理内存中,这样可以保证数据库缓存的数据都是用的物理内存,而不用交换空间,这样能大大提高数据库的速度,
但这也同样存在一个风险,就是物理内存不足的时候,原有的进程想继续申请物理内存的时候就会报错,提示Not enough space,然后再尝试把1号机数据库起起来看下属性,发现同样是true,再结合之前起库之后不就,物理内存的使用率是99%,也就说明问题了
通过修改pfile,将sga和pga都改小后,重启数据库,观察正常。
思:lock_sga=true这个属性只有在物理内存比较充裕的时候才能设,慎用!

浙公网安备 33010602011771号