mdadm 故障处理

告警通知

HostRaidDiskFailure
告警级别: warning
开始时间: 2025-11-28T22:24:22.663Z
故障主机IP: xxxx:9100
告警描述: At least one device in RAID array on xxxx:9100 failed. Array  needs attention and possibly a disk swap
 VALUE = 1
 LABELS = map[__name__:node_md_disks cluster:kafka createdate:20231208 device:md126 envir:生产 hostname:xxxxx instance:xxxx:9100 job:zhongtai-node module:性能/采控 servername:性能/采控 state:failed type:os]
详细信息: 
- alertname: HostRaidDiskFailure
- envir: 生产
- job: zhongtai-node
- severity: warning
- state: failed
- type: os

查看 RAID 状态

cat /proc/mdstat
Personalities : [raid10] [raid1] 
md1 : active raid1 nvme1n1[1] nvme0n1[0]
      3125484864 blocks super 1.2 [2/2] [UU]
      bitmap: 3/24 pages [12KB], 65536KB chunk

md125 : active raid10 sdb[0] sdc[1] sde[3] sdd[2]
      15627788288 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 38/117 pages [152KB], 65536KB chunk

md126 : active raid10 sdl[2] sdk[1] sdj[0](F) sdm[3]
      15627788288 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      bitmap: 41/117 pages [164KB], 65536KB chunk

md127 : active raid10 sdg[1] sdf[0] sdh[2] sdi[3]
      15627788288 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 38/117 pages [152KB], 65536KB chunk

unused devices: <none>

查看 md126 状态

mdadm --detail /dev/md126
/dev/md126:
           Version : 1.2
     Creation Time : Sat Dec  2 00:47:01 2023
        Raid Level : raid10
        Array Size : 15627788288 (14903.82 GiB 16002.86 GB)
     Used Dev Size : 7813894144 (7451.91 GiB 8001.43 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Dec 24 14:18:01 2025
             State : active, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 1
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : nmhs-pp-v2mw080047:103
              UUID : ff0b12ae:f767d785:96312351:76c52424
            Events : 3775086

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      160        1      active sync set-B   /dev/sdk
       2       8      176        2      active sync set-A   /dev/sdl
       3       8      192        3      active sync set-B   /dev/sdm

       0       8      144        -      faulty   /dev/sdj

确认磁盘信息

ls -l /dev/sdj
brw-rw---- 1 root disk 8, 144 Nov 28 09:48 /dev/sdj

移除故障盘

标记磁盘为 faulty

mdadm /dev/md126 --fail /dev/sdj
mdadm: set /dev/sdj faulty in /dev/md126

移除故障盘

mdadm /dev/md126 --remove  /dev/sdj
mdadm: hot removed /dev/sdj from /dev/md126

确认raid信息

mdadm --detail /dev/md126
/dev/md126:
           Version : 1.2
     Creation Time : Sat Dec  2 00:47:01 2023
        Raid Level : raid10
        Array Size : 15627788288 (14903.82 GiB 16002.86 GB)
     Used Dev Size : 7813894144 (7451.91 GiB 8001.43 GB)
      Raid Devices : 4
     Total Devices : 3
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Dec 24 14:59:02 2025
             State : clean, degraded 
    Active Devices : 3
   Working Devices : 3
    Failed Devices : 0
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : nmhs-pp-v2mw080047:103
              UUID : ff0b12ae:f767d785:96312351:76c52424
            Events : 3777482

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8      160        1      active sync set-B   /dev/sdk
       2       8      176        2      active sync set-A   /dev/sdl
       3       8      192        3      active sync set-B   /dev/sdm

新增磁盘

mdadm --add /dev/md126 /dev/sdj
mdadm: added /dev/sdj

查看 mda 恢复进度

cat /proc/mdstat
md126 : active raid10 sdj[4] sdl[2] sdk[1] sdm[3]
      15627788288 blocks super 1.2 512K chunks 2 near-copies [4/3] [_UUU]
      [>....................]  recovery =  0.1% (12079104/7813894144) finish=649.1min speed=200321K/sec
      bitmap: 41/117 pages [164KB], 65536KB chunk

查看 mda 状态

mdadm --detail /dev/md126
....
    Number   Major   Minor   RaidDevice State
       4       8      144        0      active sync set-A   /dev/sdj
       1       8      160        1      active sync set-B   /dev/sdk
       2       8      176        2      active sync set-A   /dev/sdl
       3       8      192        3      active sync set-B   /dev/sdm
posted @ 2026-01-08 17:44  小吉猫  阅读(1)  评论(0)    收藏  举报