[pve]在华为服务器上检测磁盘阵列状态
上周dell服务器坏了一块硬盘,故障信息通过关联其自带的openmanager报警到了icinga2。更换了磁盘后,想起另一个pve集群使用的是华为服务器,而华为没有类似的硬件管理软件。于是安装了阵列制造商的程序并自己写了个简单脚本检测告警。
安装阵列制造商的检测程序
确认阵列卡
# lspci  | grep -i raid
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)阵列卡为"Logic MegaRAID SAS-3 3108"
下载安装MegaRAID Storage Manager (MSM)
lsi被broadcom收购了
https://www.broadcom.cn/support
下载的zip包里面只有RPM格式的安装包,而PVE是基于debian的,所以还需要使用alien把rpm转化为deb再安装
apt install alien
tar zxvf 17.05.02.01_MSM_linux-x86.tar.gz
cd disk
alien --scripts *.rpm
dpkg --install lib-utils2_1.00-3_all.deb
dpkg --install megaraid-storage-manager_17.05.02-2_all.deb默认安装到目录/usr/local/MegaRAID Storage Manager/StorCLI/
测试程序
查看所有阵列信息,这个输出会很长
# /usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL
Adapter #0
=====================================
                    Versions
                ================
Product Name    : SAS3108
Serial No       : 
FW Package Build: 24.16.0-0106
                    Mfg. Data
                ================
Mfg. Date       : 00/00/00
......
                Image Versions in Flash:
                ================
BIOS Version       : 6.32.02.0_4.17.08.00_0x06150500
......
                Pending Images in Flash
                ================
None
                PCI Info
                ================
Controller Id   : 0000
......
                HW Configuration
                ================
......
ROC temperature : 47  degree Celcius
                Settings
                ================
Current Time                     : 3:37:33 11/4, 2020
Predictive Fail Poll Interval    : 300sec
......
                Capabilities
                ================
RAID Level Supported             : RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
......
                Status
                ================
ECC Bucket Count                 : 0
                Limitations
                ================
Max Arms Per VD          : 32
......
                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
Offline         : 0
Physical Devices  : 3
  Disks           : 2
Critical Disks  : 0
Failed Disks    : 0
                Supported Adapter Operations
                ================
Rebuild Rate                    : Yes
......
                Supported VD Operations
                ================
Read Policy          : Yes
Write Policy         : Yes
......
                Supported PD Operations
                ================
Force Online                            : Yes
......
T10 Power State                         : No
                Error Counters
                ================
Memory Correctable Errors   : 0
Memory Uncorrectable Errors : 0
                High Availability Properties
                ================
Topology Type                 : None
                Cluster Information
                ================
Cluster Permitted     : No
Cluster Active        : No
                Default Settings
                ================
Phy Polarity                     : 0
Phy PolaritySplit                : 0
Background Rate                  : 30
......我们只需要关注"Device Present"部分,如果"Degraded","Offline","Critical Disks","Failed Disks",都为"0"就判断状态磁盘正常,否则就有故障。
"Device Present"后面一共8行,只要有4个0就OK。
获取状态信息
用一个简单的组合命令:
/usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL | grep -A 8 'Device Present' | grep 0 | wc -l放入脚本
#!/bin/bash
PRESENT=$(/usr/local/MegaRAID Storage Manager/StorCLI/storcli64 -AdpAllInfo -aALL | grep -A 8 "Device Present" | grep 0 | wc -l)
if [[ $PRESENT -eq 4 ]]; then
echo 'All are OK' && exit 0
else
echo 'All are OK' && exit 2
fi测试脚本
# bash /mnt/pve/nfs199/pve/check_MegaRAID.sh
All are OK现在可以结合以前说过的钉钉告警脚本,在出现故障的时候通过钉钉发送警告,
或者集成到nagios/zabbix/icinga等监控平台。
 
                    
                     
                    
                 
                    
                
 
                
            
         
         浙公网安备 33010602011771号
浙公网安备 33010602011771号