Exadata X8及以上机型,请关注EX80缺陷(bug 35241309)

1. 案例概述

某客户的一台 Exadata X8M-2,遭遇数据库实例Crash,同时,ACFS文件系统也被unmount掉。检查日志,最终发现是由于+DG_ARCH(Normal冗余级别)异常dismount,导致了整个故障。本文主要描述+DG_ARCH磁盘组异常dismount的过程,以及最终的解决办法。

 

2、案例分析

2.1 分析所有计算节点asm实例的alert日志,观察他们开始报错时间,以及报错的内容。

2025-04-23T03:11:02.865201+08:00
NOTE: process _user200710_+asm5 (200710) initiating offline of disk 72.4042297380 (DG_ARCH_CD_00_DM04CELADM07) with mask 0x7e in group 1 (DG_ARCH) with client assisting
NOTE: initiating PST update: grp 1 (DG_ARCH), dsk = 72/0xf0f09024, mask = 0x6a, op = clear mandatory
2025-04-23T03:11:02.905214+08:00
NOTE: updating disk modes to 0x15 from 0x7f for disk 83 (DG_ARCH_CD_11_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
GMON updating disk modes for group 1 at 128 for pid 36, osid 200710
2025-04-23T03:11:02.968507+08:00
NOTE: PST update grp = 1 completed successfully
2025-04-23T03:11:02.972220+08:00
NOTE: updating disk modes to 0x15 from 0x7f for disk 78 (DG_ARCH_CD_06_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x15 from 0x7f for disk 80 (DG_ARCH_CD_08_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x15 from 0x7f for disk 75 (DG_ARCH_CD_03_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x15 from 0x7f for disk 79 (DG_ARCH_CD_07_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x15 from 0x7f for disk 76 (DG_ARCH_CD_04_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x15 from 0x7f for disk 73 (DG_ARCH_CD_01_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
2025-04-23T03:11:05.879352+08:00
NOTE: updating disk modes to 0x1 from 0x15 for disk 72 (DG_ARCH_CD_00_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x1 from 0x15 for disk 73 (DG_ARCH_CD_01_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x1 from 0x15 for disk 75 (DG_ARCH_CD_03_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x1 from 0x15 for disk 76 (DG_ARCH_CD_04_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x1 from 0x15 for disk 78 (DG_ARCH_CD_06_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x1 from 0x15 for disk 79 (DG_ARCH_CD_07_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x1 from 0x15 for disk 80 (DG_ARCH_CD_08_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
NOTE: updating disk modes to 0x1 from 0x15 for disk 83 (DG_ARCH_CD_11_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x4
2025-04-23T03:11:05.879929+08:00
NOTE: cache closing disk 72 of grp 1: DG_ARCH_CD_00_DM04CELADM07
2025-04-23T03:11:05.880824+08:00
NOTE: cache closing disk 73 of grp 1: DG_ARCH_CD_01_DM04CELADM07
2025-04-23T03:11:05.881560+08:00
NOTE: cache closing disk 75 of grp 1: DG_ARCH_CD_03_DM04CELADM07
2025-04-23T03:11:05.881833+08:00
NOTE: cache closing disk 76 of grp 1: DG_ARCH_CD_04_DM04CELADM07
2025-04-23T03:11:05.882055+08:00
NOTE: cache closing disk 78 of grp 1: DG_ARCH_CD_06_DM04CELADM07
2025-04-23T03:11:05.882275+08:00
NOTE: cache closing disk 79 of grp 1: DG_ARCH_CD_07_DM04CELADM07
2025-04-23T03:11:05.882498+08:00
NOTE: cache closing disk 80 of grp 1: DG_ARCH_CD_08_DM04CELADM07
2025-04-23T03:11:05.882714+08:00
NOTE: cache closing disk 83 of grp 1: DG_ARCH_CD_11_DM04CELADM07
2025-04-23T03:11:08.830804+08:00
NOTE: cache closing disk 72 of grp 1: (not open) DG_ARCH_CD_00_DM04CELADM07
2025-04-23T03:11:08.830866+08:00
NOTE: cache closing disk 73 of grp 1: (not open) DG_ARCH_CD_01_DM04CELADM07
2025-04-23T03:11:08.830910+08:00
NOTE: cache closing disk 75 of grp 1: (not open) DG_ARCH_CD_03_DM04CELADM07
2025-04-23T03:11:08.830972+08:00
NOTE: cache closing disk 76 of grp 1: (not open) DG_ARCH_CD_04_DM04CELADM07
2025-04-23T03:11:08.831016+08:00
NOTE: cache closing disk 78 of grp 1: (not open) DG_ARCH_CD_06_DM04CELADM07
2025-04-23T03:11:08.831058+08:00
NOTE: cache closing disk 79 of grp 1: (not open) DG_ARCH_CD_07_DM04CELADM07
2025-04-23T03:11:08.831102+08:00
NOTE: cache closing disk 80 of grp 1: (not open) DG_ARCH_CD_08_DM04CELADM07
2025-04-23T03:11:08.831145+08:00
NOTE: cache closing disk 83 of grp 1: (not open) DG_ARCH_CD_11_DM04CELADM07

2025-04-23T03:11:11.466956+08:00
SQL> /* Exadata Auto Mgmt: ONLINE ASM Disk */
alter diskgroup DG_ARCH online disk DG_ARCH_CD_09_DM04CELADM07
, DG_ARCH_CD_02_DM04CELADM07
, DG_ARCH_CD_00_DM04CELADM07
, DG_ARCH_CD_01_DM04CELADM07
, DG_ARCH_CD_03_DM04CELADM07
, DG_ARCH_CD_04_DM04CELADM07
, DG_ARCH_CD_06_DM04CELADM07
, DG_ARCH_CD_07_DM04CELADM07
, DG_ARCH_CD_08_DM04CELADM07
, DG_ARCH_CD_11_DM04CELADM07
nowait

2025-04-23T03:11:11.581249+08:00
NOTE: cache closing disk 83 of grp 1: (not open) DG_ARCH_CD_11_DM04CELADM07
NOTE: updating disk modes to 0x11 from 0x1 for disk 72 (DG_ARCH_CD_00_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 73 (DG_ARCH_CD_01_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 74 (DG_ARCH_CD_02_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 75 (DG_ARCH_CD_03_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 76 (DG_ARCH_CD_04_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 78 (DG_ARCH_CD_06_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 79 (DG_ARCH_CD_07_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 80 (DG_ARCH_CD_08_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 81 (DG_ARCH_CD_09_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0
NOTE: updating disk modes to 0x11 from 0x1 for disk 83 (DG_ARCH_CD_11_DM04CELADM07) in group 1 (DG_ARCH): lflags 0x0

2025-04-23T03:11:11.595321+08:00
NOTE: disk validation pending for 10 disks in group 1/0x59107fd8 (DG_ARCH)
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_00_dm04celadm07 for disk DG_ARCH_CD_00_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_02_dm04celadm07 for disk DG_ARCH_CD_02_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_06_dm04celadm07 for disk DG_ARCH_CD_06_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_04_dm04celadm07 for disk DG_ARCH_CD_04_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_11_dm04celadm07 for disk DG_ARCH_CD_11_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_01_dm04celadm07 for disk DG_ARCH_CD_01_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_07_dm04celadm07 for disk DG_ARCH_CD_07_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_08_dm04celadm07 for disk DG_ARCH_CD_08_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_03_dm04celadm07 for disk DG_ARCH_CD_03_DM04CELADM07
NOTE: Found o/12.56.130.125;12.56.130.126/DG_ARCH_CD_09_dm04celadm07 for disk DG_ARCH_CD_09_DM04CELADM07
NOTE: completed disk validation for 1/0x59107fd8 (DG_ARCH)
2025-04-23T03:11:12.255977+08:00
NOTE: initiating client [sdswjcdb1:sdswjcdb:cluster-dm04] discovery for group 1 (reqid:6418231962720382668)
2025-04-23T03:11:12.256026+08:00
NOTE: initiating client [+APX5:+APX:cluster-dm04] discovery for group 1 (reqid:6418231962720382668)
NOTE: client [+APX5:+APX:cluster-dm04] completed disk validation (reqid:6418231962720382668 KFNPDRFLG=1)
2025-04-23T03:11:12.969339+08:00
NOTE: client [sdswjcdb1:sdswjcdb:cluster-dm04] completed disk validation (reqid:6418231962720382668 KFNPDRFLG=1)
2025-04-23T03:11:14.902681+08:00
NOTE: reopening 10 disks for group 1

2025-04-23T03:11:15.077164+08:00
NOTE: client sdswjcdb1:sdswjcdb:cluster-dm04 mounted group 1 (DG_ARCH)

2025-04-23T03:11:15.686606+08:00
GMON querying group 1 at 130 for pid 28, osid 68394
NOTE: cache opening disk 72 of grp 1: DG_ARCH_CD_00_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_00_dm04celadm07
NOTE: cache opening disk 73 of grp 1: DG_ARCH_CD_01_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_01_dm04celadm07
NOTE: cache opening disk 74 of grp 1: DG_ARCH_CD_02_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_02_dm04celadm07
NOTE: cache opening disk 75 of grp 1: DG_ARCH_CD_03_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_03_dm04celadm07
NOTE: cache opening disk 76 of grp 1: DG_ARCH_CD_04_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_04_dm04celadm07
NOTE: cache opening disk 78 of grp 1: DG_ARCH_CD_06_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_06_dm04celadm07
NOTE: cache opening disk 79 of grp 1: DG_ARCH_CD_07_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_07_dm04celadm07
NOTE: cache opening disk 80 of grp 1: DG_ARCH_CD_08_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_08_dm04celadm07
NOTE: cache opening disk 81 of grp 1: DG_ARCH_CD_09_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_09_dm04celadm07
NOTE: cache opening disk 83 of grp 1: DG_ARCH_CD_11_DM04CELADM07 path:o/12.56.130.125;12.56.130.126/DG_ARCH_CD_11_dm04celadm07

2025-04-23T03:11:18.701713+08:00
NOTE: starting rebalance of group 1/0x591223f7 (DG_ARCH) at power 11
NOTE: starting process ARBA
Starting background process ARBA
2025-04-23T03:11:18.801346+08:00
ARB0 started with pid=77, OS id=103295
NOTE: assigning ARBA to group 1/0x591223f7 (DG_ARCH) to compute estimates
NOTE: assigning ARB0 to group 1/0x591223f7 (DG_ARCH) with 11 parallel I/Os

2025-04-23T03:41:48.219432+08:00
NOTE: process _user83008_+asm5 (83008) initiating offline of disk 17.4042297442 (DG_ARCH_CD_05_DM04CELADM02) with mask 0x7e in group 1 (DG_ARCH) with client assisting
NOTE: initiating PST update: grp 1 (DG_ARCH), dsk = 17/0xf0f09062, mask = 0x6a, op = clear mandatory
2025-04-23T03:41:48.230032+08:00
GMON updating disk modes for group 1 at 131 for pid 65, osid 83008
ERROR: disk 17 (DG_ARCH_CD_05_DM04CELADM02) in group 1 (DG_ARCH) cannot be offlined because all disks [17(DG_ARCH_CD_05_DM04CELADM02), 72(DG_ARCH_CD_00_DM04CELADM07)] with mirrored data would be offline.
2025-04-23T03:41:48.249361+08:00
ERROR: too many offline disks in PST (grp 1)
2025-04-23T03:41:48.293440+08:00
NOTE: halting all I/Os to diskgroup 1 (DG_ARCH)
2025-04-23T03:41:49.170741+08:00
SQL> alter diskgroup DG_ARCH dismount force /* ASM SERVER:1494253528 */
2025-04-23T03:41:49.190989+08:00
NOTE: client +ASM5:+ASM:cluster-dm04 no longer has group 1 (DG_ARCH) mounted

 

 

下面,对于上述ASM日志进行简要说明:

    时间在03点11分02秒, 第7台存储节点的CD_00磁盘(也即DG_ARCH磁盘组中的72号磁盘)发起了offline操作,除此之外,72、73、75、76、78、79、80、83号磁盘也发生了状态变更,处于close状态,这些磁盘也都隶属于第7台存储节点。

    03点11分11秒,存储管理软件尝试 online这次磁盘。03点11分15秒,这些磁盘online成功。03点11分18秒开始,进行ASM的Rebalance操作。

    03点41分48秒(也即开始Rebalance操作的半小时后,此时的Rebalance操作仍未结束),又对第2台存储节点的CD_05磁盘(也即DG_ARCH磁盘组中的17号磁盘)发起了offline操作。由于17号磁盘与72号磁盘是partner关系,所以强制将DG_ARCH磁盘组dismount掉。

 

2.2 到了这里,DG_ARCH磁盘组dismount掉的原因非常明确。 但是,我们需要继续深究,在03点11分02秒左右,第7台存储节点为什么突然有10块磁盘offline? 

查看第7台存储节点的存储软件日志。

2025-04-23T03:11:02.362468+08:00
NO IO COMPLETION ON DISK /dev/sdg FOR 5000 MILLISECONDS: CD - CD_06_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sdg AT: Wed Apr 23 03:11:02 2025 242 msec
LAST IO COMPLETED ON DISK /dev/sdg AT: Wed Apr 23 03:10:56 2025 580 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdg AT: Wed Apr 23 03:10:57 2025 36 msec
NO IO COMPLETION ON DISK /dev/sdj FOR 5000 MILLISECONDS: CD - CD_09_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sdj AT: Wed Apr 23 03:11:02 2025 242 msec
LAST IO COMPLETED ON DISK /dev/sdj AT: Wed Apr 23 03:10:56 2025 735 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdj AT: Wed Apr 23 03:10:57 2025 101 msec
NO IO COMPLETION ON DISK /dev/sdi FOR 5000 MILLISECONDS: CD - CD_08_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sdi AT: Wed Apr 23 03:11:02 2025 242 msec
LAST IO COMPLETED ON DISK /dev/sdi AT: Wed Apr 23 03:10:56 2025 730 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdi AT: Wed Apr 23 03:10:57 2025 36 msec
NO IO COMPLETION ON DISK /dev/sdd FOR 5000 MILLISECONDS: CD - CD_03_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sdd AT: Wed Apr 23 03:11:02 2025 242 msec
LAST IO COMPLETED ON DISK /dev/sdd AT: Wed Apr 23 03:10:56 2025 215 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdd AT: Wed Apr 23 03:10:57 2025 186 msec
NO IO COMPLETION ON DISK /dev/sdh FOR 5000 MILLISECONDS: CD - CD_07_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sdh AT: Wed Apr 23 03:11:02 2025 242 msec
LAST IO COMPLETED ON DISK /dev/sdh AT: Wed Apr 23 03:10:56 2025 415 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdh AT: Wed Apr 23 03:10:57 2025 101 msec
NO IO COMPLETION ON DISK /dev/sde FOR 5000 MILLISECONDS: CD - CD_04_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sde AT: Wed Apr 23 03:11:02 2025 242 msec
LAST IO COMPLETED ON DISK /dev/sde AT: Wed Apr 23 03:10:56 2025 760 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sde AT: Wed Apr 23 03:10:57 2025 141 msec
NO IO COMPLETION ON DISK /dev/sdl FOR 5000 MILLISECONDS: CD - CD_11_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sdl AT: Wed Apr 23 03:11:02 2025 242 msec
LAST IO COMPLETED ON DISK /dev/sdl AT: Wed Apr 23 03:10:56 2025 590 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdl AT: Wed Apr 23 03:10:57 2025 16 msec
NO IO COMPLETION ON DISK /dev/sdb FOR 5000 MILLISECONDS: CD - CD_01_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sdb AT: Wed Apr 23 03:11:02 2025 297 msec
LAST IO COMPLETED ON DISK /dev/sdb AT: Wed Apr 23 03:10:56 2025 500 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sdb AT: Wed Apr 23 03:10:57 2025 166 msec
NO IO COMPLETION ON DISK /dev/sda FOR 5000 MILLISECONDS: CD - CD_00_dm04celadm07 TIME - Wed Apr 23 03:11:02 2025 192 msec.
LAST IO SUBMITTED ON DISK /dev/sda AT: Wed Apr 23 03:11:02 2025 242 msec
LAST IO COMPLETED ON DISK /dev/sda AT: Wed Apr 23 03:10:56 2025 190 msec
FIRST IO SUBMITTED SINCE LAST COMPLETION ON DISK /dev/sda AT: Wed Apr 23 03:10:57 2025 81 msec

2025-04-23T03:11:02.804779+08:00
Redo log write error 2203 (IO cancelled due to failed disk) on griddisk DG_ARCH_CD_03_dm04celadm07: use of Flash Log for this device has been disabled
Redo log write error 2203 (IO cancelled due to failed disk) on griddisk DG_ARCH_CD_08_dm04celadm07: use of Flash Log for this device has been disabled
Redo log write error 2203 (IO cancelled due to failed disk) on griddisk DG_ARCH_CD_04_dm04celadm07: use of Flash Log for this device has been disabled
Redo log write error 2203 (IO cancelled due to failed disk) on griddisk DG_ARCH_CD_07_dm04celadm07: use of Flash Log for this device has been disabled
2025-04-23T03:11:02.833501+08:00
Redo log write error 2203 (IO cancelled due to failed disk) on griddisk DG_ARCH_CD_00_dm04celadm07: use of Flash Log for this device has been disabled
2025-04-23T03:11:02.844771+08:00
Redo log write error 2203 (IO cancelled due to failed disk) on griddisk DG_ARCH_CD_11_dm04celadm07: use of Flash Log for this device has been disabled
......

2025-04-23T03:13:26.062469+08:00
Application of all saved redo for griddisk DG_ARCH_CD_01_dm04celadm07: use of Flash Log for this device has been re-enabled
2025-04-23T03:15:00.917210+08:00
Application of all saved redo for griddisk DG_ARCH_CD_03_dm04celadm07: use of Flash Log for this device has been re-enabled
2025-04-23T03:15:02.315461+08:00
Application of all saved redo for griddisk DG_ARCH_CD_10_dm04celadm07: use of Flash Log for this device has been re-enabled
2025-04-23T03:15:03.029802+08:00
Application of all saved redo for griddisk DG_ARCH_CD_04_dm04celadm07: use of Flash Log for this device has been re-enabled
2025-04-23T03:15:03.589683+08:00
Application of all saved redo for griddisk DG_ARCH_CD_08_dm04celadm07: use of Flash Log for this device has been re-enabled
2025-04-23T03:15:56.966866+08:00
Application of all saved redo for griddisk DG_ARCH_CD_11_dm04celadm07: use of Flash Log for this device has been re-enabled
2025-04-23T03:15:58.281695+08:00
Application of all saved redo for griddisk DG_ARCH_CD_00_dm04celadm07: use of Flash Log for this device has been re-enabled
2025-04-23T03:15:58.616238+08:00
Application of all saved redo for griddisk DG_ARCH_CD_07_dm04celadm07: use of Flash Log for this device has been re-enabled
2025-04-23T03:15:59.177626+08:00
Application of all saved redo for griddisk DG_ARCH_CD_06_dm04celadm07: use of Flash Log for this device has been re-enabled
......

从存储软件的日志可以看出:在03点11分02秒, 第7台存储节点上有大量的磁盘没有IO响应,写磁盘的IO出现错误,但在03点13分左右,可以看到磁盘IO已经恢复正常。

检查第2台存储节点的日志,可以看出在03点41分47秒左右,同样有大量的磁盘没有IO响应。具体日志与第7台存储节点类似。

 

2.3 至此,整个故障的大致时间链如下:在03点11分02秒, 第7台存储节点上有大量的磁盘没有IO响应,写磁盘的IO出现错误,触发了磁盘的offline操作,几分钟后,磁盘IO恢复正常,ASM发起Rebalance操作,在Rebalance的过程中,第2台存储节点上又出现大量的磁盘没有IO响应,由于第2台存储节点上的某些磁盘与第7台存储节点上的某些磁盘是partner关系,最终导致Normal冗余的DG_ARCH磁盘组dismount掉。

 

2.4, 现在,问题的重点在于:一台存储节点为什么会同时出现大量的磁盘没有IO响应? 查看了故障时间段的OSW日志,显示IO使用率极低,这说明不是IO高负载压力造成的IO没响应。

2.5 搜索MOS文章,发现:《(EX80) Storage servers experience high rates of hard disk drive failure, which can cause ASM disk group loss (Doc ID 2974254.1)》,完全吻合该故障。

2.6 最终,升级存储管理软件,解决该故障。

 

3、案例总结

这个BUG的影响还是非常大的,官方强烈建议尽快处理。

 

posted @ 2025-08-06 09:52  石云华  阅读(32)  评论(0)    收藏  举报