告警通知
HostRaidDiskFailure
告警级别: warning
开始时间: 2025-11-28T22:24:22.663Z
故障主机IP: xxxx:9100
告警描述: At least one device in RAID array on xxxx:9100 failed. Array needs attention and possibly a disk swap
VALUE = 1
LABELS = map[__name__:node_md_disks cluster:kafka createdate:20231208 device:md126 envir:生产 hostname:xxxxx instance:xxxx:9100 job:zhongtai-node module:性能/采控 servername:性能/采控 state:failed type:os]
详细信息:
- alertname: HostRaidDiskFailure
- envir: 生产
- job: zhongtai-node
- severity: warning
- state: failed
- type: os
查看 RAID 状态
cat /proc/mdstat
Personalities : [raid10] [raid1]
md1 : active raid1 nvme1n1[1] nvme0n1[0]
3125484864 blocks super 1.2 [2/2] [UU]
bitmap: 3/24 pages [12KB], 65536KB chunk
md125 : active raid10 sdb[0] sdc[1] sde[3] sdd[2]
15627788288 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
bitmap: 38/117 pages [152KB], 65536KB chunk
md126 : active raid10 sdl[2] sdk[1] sdj[0](F) sdm[3]
15627788288 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
bitmap: 41/117 pages [164KB], 65536KB chunk
md127 : active raid10 sdg[1] sdf[0] sdh[2] sdi[3]
15627788288 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
bitmap: 38/117 pages [152KB], 65536KB chunk
unused devices: <none>
查看 md126 状态
mdadm --detail /dev/md126
/dev/md126:
Version : 1.2
Creation Time : Sat Dec 2 00:47:01 2023
Raid Level : raid10
Array Size : 15627788288 (14903.82 GiB 16002.86 GB)
Used Dev Size : 7813894144 (7451.91 GiB 8001.43 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Dec 24 14:18:01 2025
State : active, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Consistency Policy : bitmap
Name : nmhs-pp-v2mw080047:103
UUID : ff0b12ae:f767d785:96312351:76c52424
Events : 3775086
Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 160 1 active sync set-B /dev/sdk
2 8 176 2 active sync set-A /dev/sdl
3 8 192 3 active sync set-B /dev/sdm
0 8 144 - faulty /dev/sdj
确认磁盘信息
ls -l /dev/sdj
brw-rw---- 1 root disk 8, 144 Nov 28 09:48 /dev/sdj
移除故障盘
标记磁盘为 faulty
mdadm /dev/md126 --fail /dev/sdj
mdadm: set /dev/sdj faulty in /dev/md126
移除故障盘
mdadm /dev/md126 --remove /dev/sdj
mdadm: hot removed /dev/sdj from /dev/md126
确认raid信息
mdadm --detail /dev/md126
/dev/md126:
Version : 1.2
Creation Time : Sat Dec 2 00:47:01 2023
Raid Level : raid10
Array Size : 15627788288 (14903.82 GiB 16002.86 GB)
Used Dev Size : 7813894144 (7451.91 GiB 8001.43 GB)
Raid Devices : 4
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Dec 24 14:59:02 2025
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 512K
Consistency Policy : bitmap
Name : nmhs-pp-v2mw080047:103
UUID : ff0b12ae:f767d785:96312351:76c52424
Events : 3777482
Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 160 1 active sync set-B /dev/sdk
2 8 176 2 active sync set-A /dev/sdl
3 8 192 3 active sync set-B /dev/sdm
新增磁盘
mdadm --add /dev/md126 /dev/sdj
mdadm: added /dev/sdj
查看 mda 恢复进度
cat /proc/mdstat
md126 : active raid10 sdj[4] sdl[2] sdk[1] sdm[3]
15627788288 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
[>....................] recovery = 0.1% (12079104/7813894144) finish=649.1min speed=200321K/sec
bitmap: 41/117 pages [164KB], 65536KB chunk
查看 mda 状态
mdadm --detail /dev/md126
....
Number Major Minor RaidDevice State
4 8 144 0 active sync set-A /dev/sdj
1 8 160 1 active sync set-B /dev/sdk
2 8 176 2 active sync set-A /dev/sdl
3 8 192 3 active sync set-B /dev/sdm