记录自己第一次pg修复过程-由直接拔节点导致的pg异常

第一次修复ceph pg 记录下。 前提: 有2个节点异常,服务器硬件设备拔出。但是crushmap没有做修改。于是后台手动删除osd以及对应节点.arm73 arm74节点分别只有一个osd 其id号分别为1和7 删除osd方法:ceph osd out osd.1    //osd.7 ceph osd crush rm osd.1    //osd.7 ceph auth del osd.1   //osd.7 ceph osd rm osd.1   //osd.7 删除arm73 arm74 两个节点 Ceph osd crush rm arm73 Ceph osd crush rm arm74 之后集群状态如下 root@arm71:/var/log/ceph# ceph -s cluster b6e874e6-3ae1-45c4-85cb-55f5e3625b56 health HEALTH_WARN 5 pgs degraded 5 pgs stale 5 pgs stuck stale 5 pgs stuck unclean 5 pgs undersized recovery 377/189470 objects degraded (0.199%) monmap e1: 1 mons at {arm71=192.168.0.71:6789/0} election epoch 1, quorum 0 arm71 mdsmap e29: 1/1/1 up {0=arm72=up:active} osdmap e390: 16 osds: 16 up, 16 in pgmap v303321: 1216 pgs, 3 pools, 369 GB data, 94725 objects 834 GB used, 42067 GB / 45162 GB avail 377/189470 objects degraded (0.199%) 1211 active+clean 5 stale+active+undersized+degraded 查看集群异常pg root@arm71:/var/log/ceph# ceph pg dump_stuck unclean ok pg_stat    state         up     up_primary      acting       acting_primary 1.236        stale+active+undersized+degraded        [7]    7       [7]    7 1.237        stale+active+undersized+degraded        [7]    7       [7]    7 1.9e stale+active+undersized+degraded        [7]    7       [7]    7 0.f    stale+active+undersized+degraded        [7]    7       [7]    7 1.e   stale+active+undersized+degraded        [7]    7       [7]    7 root@arm71:/var/log/ceph# ceph pg map 0.f osdmap e388 pg 0.f (0.f) -> up [3,6] acting [3,6]   显示如上5个pg异常 其中显示acting 为7 但是之前这个id7的osd已经从集群中剔除了。查看此pg的详细信息的时候显示的确实真实存在的osd.也就是说集群哪个地方不识别该pg新的osd信息或者此pg有其他异常。也重启了mon所在节点,还是报此错误。     首先查看pg信息并scrub root@arm71:/var/log/ceph# ceph pg 1.e query Error ENOENT: i don't have pgid 1.e root@arm71:/var/log/ceph# ceph pg scrub 1.e Error EAGAIN: pg 1.e primary osd.7 not up 于是尝试修复此5个pg   恢复一个丢失的pg  ceph pg {pg-id} mark_unfound_lost revert 运行 ceph pg repair 9.1e,启动 PG 修复过程 root@arm71:/var/log/ceph# ceph pg repair 1.e Error EAGAIN: pg 1.e primary osd.7 not up root@arm71:/var/log/ceph# ceph pg 0.f mark_unfound_lost revert Error ENOENT: i don't have pgid 0.f 均提示pgid 不存在,于是创建不存在的pg ceph pg force_create_pg $pg root@arm71:/var/log/ceph# ceph pg force_create_pg 1.236 pg 1.236 now creating, ok root@arm71:/var/log/ceph# ceph pg force_create_pg 1.237 pg 1.237 now creating, ok root@arm71:/var/log/ceph# ceph pg force_create_pg 1.9e pg 1.9e now creating, ok root@arm71:/var/log/ceph# ceph pg force_create_pg 0.f pg 0.f now creating, ok root@arm71:/var/log/ceph# ceph pg force_create_pg 1.e pg 1.e now creating, ok   之后查看集群状态,显示creating状态。 cluster b6e874e6-3ae1-45c4-85cb-55f5e3625b56 health HEALTH_WARN 5 pgs stuck inactive 5 pgs stuck unclean monmap e1: 1 mons at {arm71=192.168.0.71:6789/0} election epoch 1, quorum 0 arm71 mdsmap e29: 1/1/1 up {0=arm72=up:active} osdmap e392: 16 osds: 16 up, 16 in pgmap v303333: 1216 pgs, 3 pools, 368 GB data, 94348 objects 834 GB used, 42067 GB / 45162 GB avail 1211 active+clean 5 creating root@arm72:~# ceph pg dump_stuck unclean ok pg_stat    state         up     up_primary      acting       acting_primary 1.236        creating   []       -1      []       -1 1.237        creating   []       -1      []       -1 1.9e creating   []       -1      []       -1 0.f    creating   []       -1      []       -1 1.e   creating   []       -1      []       -1   于是百度后ceph pg force_create_pg $pg 参考http://www.spinics.net/lists/ceph-users/msg21886.html   7) At this point, for the PGs to leave the 'creating' status, I>>     had to restart all remaining OSDs. Otherwise those PGs were in the>>     creating state forever.于是重启了mon所在节点集群Service ceph restartroot@arm71:~# ceph health detailHEALTH_OK 至此OK         PS:常用pg操作
  1. #ceph pg dump_stuck stale
  2. #ceph pg dump_stuck inactive
  3. #ceph pg dump_stuck unclean
  4. [root@osd2 ~]# ceph pg map 5.de    //查看pg map信息
  5. osdmap e689 pg 5.de (5.de) -> up [0,4] acting [0,4]
  6. [root@osd2 ~]# ceph pg 5.de query //查询一个pg的详细信息
  7. [root@osd2 ~]# ceph pg scrub 5.de   //对pg进行数据校验 主副本一致性
  8. instructing pg 5.de on osd.0 to scrub
  9. [root@osd2 ~]# ceph pg 5.de mark_unfound_lost revert //恢复一个丢失的pg
  10. pg has no unfound objects
  11. [root@osd1 mnt]# ceph pg repair 5.de  //修复pg
  12. instructing pg 5.de on osd.0 to repair
  13. [root@osd2 ~]# ceph osd lost 1
  14. Error EPERM: are you SURE?  this might mean real, permanent data loss.  pass--yes-i-really-mean-it if you really do.
  15. [root@osd2 ~]# ceph osd lost 4 --yes-i-really-mean-it
  16. isnot down or doesn't exist
  17. [root@osd2 ~]# service ceph stop osd.4
  18. === osd.4 ===
  19. Stopping Ceph osd.4 on osd2...kill 22287...kill 22287...done
  20. [root@osd2 ~]# ceph osd lost 4 --yes-i-really-mean-it
  21. marked osd lost inepoch 690
 

posted on 2017-06-01 13:09  歪歪121  阅读(588)  评论(0)    收藏  举报