[ Linux ] 服务器更换硬盘
服务器型号:ThinkServer RD650
操作系统:Red Hat 6.7
业务用途:生产环境监控机
如果rpm包安装下载链接:https://raw.githubusercontent.com/crazy-zhangcong/tools/master/MegaCli8.07.10.tar.gz
如上两个 rpm 包都必须安装。
接到机房邮件通知,告知某台服务器硬盘告警,并提供了设备SN号和机柜位置。根据提供想相关信息找到对应的设备IP并确定该设备的业务用途。
登录到该设备确认SN号是否对应:
#dmidecode | egrep -A10 -i 'system infor' System Information Manufacturer: LENOVO Product Name: ThinkServer RD650 Version: XXXXX Serial Number:xxxxx # S/N号 UUID: xxxxx Wake-up Type: Power Switch SKU Number: LENOVO_MT_ND Family: ThinkServer Handle 0x0002, DMI type 2, 15 bytes
S/N号确认是该服务器,该服务器是一台监控机,确保业务没有受到影响。通过 MegaCli64 命令查看物理磁盘状况
#./MegaCli64 -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Offline Firmware state: Online, Spun Up
发现第三块磁盘已经处于 Offline 的状态,需要尽快处理,再次检查 raid 级别:
#./MegaCli64 -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name :1 RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 # RAID 1 Size : 3.635 TB Mirror Data : 3.635 TB State : Degraded Strip Size : 64 KB Number Of Drives per span:2 Span Depth : 2 Default Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAhead, Direct, No Write Cache if Bad BBU Default Access Policy: Read/Write Current Access Policy: Read/Write Disk Cache Policy : Disabled Encryption Type : None Is VD Cached: No Exit Code: 0x00
通过查看得知,4块盘做到 RAID 1, 虽说是监控机,但是4块盘组 RAID 10 还是更好。
去到机房,换上备用的同型号容量的硬盘,在使用 MegaCli64 命令查看磁盘状态:
#./MegaCli64 -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: Online, Spun Up
查看进度
### 查看Enclosure Device ID 和 Slot Number megacli -PDList -aAll -NoLog | egrep -B 20 -i rebuild Port's Linkspeed: 12.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 21 Slot Number: 11 Drive's postion: DiskGroup: 1, Span: 2, Arm: 1 Enclosure position: 0 Device Id: 18 WWN: 5000039B88D28D36 Sequence Number: 4 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3710000 Sectors] Firmware state: Rebuild ### 查看rebuild 进度 megacli -PDRbld -ShowProg -PhysDrv[21:11] -aAll Rebuild Progress on Device at Enclosure 21, Slot 11 Completed 4% in 5 Minutes. ### 进度条查看 rebuild 进度 ### 注意:[Enclosure Device ID:Slot Number] megacli -pdrbld -progdsply -physdrv[21:11] -aALL
确认磁盘已经读到,并 RAID组在重建。过上一段时间。再次查看重建的进度:
#./MegaCli64 -PDRbld -ProgDsply -PhysDrv [:2] -a0 说明: [:2] 磁盘从0开始,第三块磁盘的索引是 2 。
动态可视化的重建进度如下图:
或者使用如下查看 rebuilding 状态:
megacli -AdpAllinfo -aALL | grep -i rebuild
位置索引为2的磁盘重建进度已经 99% 已耗时 5小时9分钟,重建完成后,再次检查物理磁盘的状态:
#./MegaCli64 -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up
所有磁盘都已经 Online 状态,本次更换磁盘完毕,数据没有受到任何的影响。
质疑一:
在 RAID 重建的时候,数据无法写入到逻辑盘?
答:经过真实环境测试,RAID 在重建的时候,并不会影响数据的写入操作。
此次是针对该服务器是监控机,数据并不重要的前提下,直接进行的热插拔更换物理磁盘,重建 RAID 组,在业务服务器上,还请备份,备份,再备份进行磁盘的更换的操作,不要过分相信冗余和灾备,一旦发生不可控事件,数据可能就无法恢复了,请慎重!
关于 MegaCli64 命令的使用,请参考:
http://www.ttlsa.com/tools/megacli-monitor-raid-status/