Controller cache pinned for missing or offline

背景

ceph集群的osd突然down,storcli看到磁盘是offline状态

排查步骤

  • 查看raid各卷组状态
sudo storcli64 /c0/eall/sall show 
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Failure
Description = Show Drive Information Failed.

Detailed Status :
===============

--------------------------------
Drive      Status  ErrCd ErrMsg 
--------------------------------
/c0/e0/s1  Success     0 -      
/c0/e0/s2  Success     0 -      
/c0/e0/s4  Success     0 -      
/c0/e0/s5  Success     0 -      
/c0/e0/s7  Success     0 -      
/c0/e0/s8  Success     0 -      
/c0/e0/s10 Success     0 -      
/c0/e0/s11 Failure    46 -      
/c0/e0/s14 Success     0 -      
/c0/e0/s15 Success     0 -      
--------------------------------



Drive Information :
=================

----------------------------------------------------------------------------------
EID:Slt DID State  DG       Size Intf Med SED PI SeSz Model               Sp Type 
----------------------------------------------------------------------------------
0:1       7 Onln    1   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:2      14 Onln    2   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:4      13 Onln    3   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:5       4 Onln    4   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:7       3 Onln    5   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:8       8 Onln    6   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:10      9 Onln    7   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:11     11 Failed  8   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM    U  -    
0:14      1 Onln    0 138.766 GB SATA SSD N   N  512B INTEL SSDSC2BB150G7 U  -    
0:15      2 Onln    0 138.766 GB SATA SSD N   N  512B INTEL SSDSC2BB150G7 U  -    
----------------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild

/c0/e0/s11 的错误码 46

  • 查看raid的vd状态
# sudo storcli64 /c0/vall show 

CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = None


Virtual Drives :
==============

---------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC       Size Name 
---------------------------------------------------------------
0/0   RAID1 Optl  RW     Yes     NRWTD -   ON  138.766 GB      
1/1   RAID0 Optl  RW     Yes     NRWTD -   ON    7.276 TB      
2/2   RAID0 Optl  RW     Yes     NRWTD -   ON    7.276 TB      
3/3   RAID0 Optl  RW     Yes     NRWTD -   ON    7.276 TB      
4/4   RAID0 Optl  RW     Yes     NRWTD -   ON    7.276 TB      
5/5   RAID0 Optl  RW     Yes     NRWTD -   ON    7.276 TB      
6/6   RAID0 Optl  RW     Yes     NRWTD -   ON    7.276 TB      
7/7   RAID0 Optl  RW     Yes     NRWTD -   ON    7.276 TB      
8/8   RAID0 OfLn  RW     No      NRWTD -   ON    7.276 TB      
---------------------------------------------------------------

VD=Virtual Drive| DG=Drive Group|Rec=Recovery
Cac=CacheCade|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|dflt=Default|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady
B=Blocked|Consist=Consistent|R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

此时vd8已经offline

  • 查看raid卡事件
# sudo storcli64 /c0 show events filter=fatal

seqNum: 0x0000510a
Time: Sun Feb 12 18:11:01 2023

Code: 0x00000143
Class: 3
Locale: 0x21
Event Description: Controller cache pinned for missing or offline VD 08/8
Event Data:
===========
Target Id: 8


seqNum: 0x0000510b
Time: Sun Feb 12 18:11:01 2023

Code: 0x000000fc
Class: 3
Locale: 0x01
Event Description: VD 08/8 is now OFFLINE
Event Data:
===========
Target Id: 8
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = None

Events = GETEVENTS

Controller Properties :
=====================

------------------------------------
Ctrl Status  Method          Value  
------------------------------------
   0 Success handleSuboption Events 
------------------------------------

故障触发点: Controller cache pinned for missing or offline VD

原因

磁盘因未知原因连接断开,cache上还有未回刷完成的数据。

解决办法

$ sudo storcli64 /c0 show preservedcache
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = None


--------------------
VD     Size State   
--------------------
 8 7.276 TB Offline 
--------------------
$ sudo storcli64 /c0/v8 delete preservedcache
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = Virtual Drive preserved Cache Data Cleared.
$ sudo storcli64 /c0/e0/s11 set online
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = Set Drive Online Succeeded.
$ sudo storcli64 /c0/eall/sall show 
CLI Version = 007.2309.0000.0000 Sep 16, 2022
Operating system = Linux 5.4.0-137-generic
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive Information :
=================

---------------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model               Sp Type 
---------------------------------------------------------------------------------
0:1       7 Onln   1   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:2      14 Onln   2   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:4      13 Onln   3   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:5       4 Onln   4   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:7       3 Onln   5   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:8       8 Onln   6   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:10      9 Onln   7   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:11     11 Onln   8   7.276 TB SATA HDD N   N  512B ST8000NM0055-1RM112 U  -    
0:14      1 Onln   0 138.766 GB SATA SSD N   N  512B INTEL SSDSC2BB150G7 U  -    
0:15      2 Onln   0 138.766 GB SATA SSD N   N  512B INTEL SSDSC2BB150G7 U  -    
---------------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild
$ sudo systemctl reset-failed ceph-osd@71
$ sudo systemctl restart ceph-osd@71

dell

posted @ 2023-02-14 09:53  ishmaelwanglin  阅读(402)  评论(0)    收藏  举报