Ceph的常见问题--requests are blocked

ceph requests are blocked的异常解决办法

问题背景:

ceph环境中常遇到下面的错误

[root@xxx ~]# ceph -s
    cluster dc4f91c1-8792-4948-b68f-2fcea75f53b9
     health HEALTH_WARN 1 requests are blocked > 32 sec
     monmap e3: 5 mons at {xxx-cinder015-128055=240.30.128.55:6789/0,xxx-ceph-cinder017-128057=240.30.128.57:6789/0,xxx-ceph-cinder024-128074=240.30.128.74:6789/0,xxx-ceph-cinder025-128075=240.30.128.75:6789/0,xxx-ceph-cinder026-128076=240.30.128.76:6789/0}, election epoch 216, quorum 0,1,2,3,4 xxx-ceph-cinder015-128055,hh-yun-ceph-cinder017-128057,xxx-ceph-cinder024-128074,xxx-ceph-cinder025-128075,xxx-ceph-cinder026-128076
     osdmap e97975: 190 osds: 190 up, 190 in
      pgmap v13666786: 20544 pgs, 2 pools, 77479 GB data, 19508 kobjects
            228 TB used, 426 TB / 654 TB avail
               20542 active+clean
                   2 active+clean+scrubbing+deep
  client io 47657 kB/s rd, 164 MB/s wr, 5406 op/s

1 requests are blocked > 32 sec 有可能是在数据迁移过程中, 用户正在对该数据块进行访问, 但访问还没有完成,数据就迁移到别的 OSD 中, 那么就会导致有请求被 block, 对用户也是有影响的

解决方案:

1、寻找block的请求

(ceph-mon)[root@control01 /]# ceph health detail
    HEALTH_WARN 2 requests are blocked > 32 sec; 1 osds have slow requests
    2 ops are blocked > 4194.3 sec on osd.5
    1 osds have slow requests

可以看到osd.5具有一个操作block

2、查找osd对应的主机

例如:
[root@TX-LNSGF-MANAGE-01 ~]# ceph osd find 5
{
    "osd": 5,
    "ip": "10.64.251.105:6809\/988225",
    "crush_location": {
        "host": "TX-LNSGF-STORAGE-05",
        "root": "default"
    }
}

3、重启osd的服务

systemctl start ceph-osd@5

系统会对该 osd 执行 recovery 操作, recovery 过程中, 会断开 block request, 那么这个 request 将会重新请求 mon 节点, 并重新获得新的 pg map, 得到最新的数据访问位置, 从而解决上述问题

4、查看集群状态

(ceph-mon)[root@control01 /]# ceph -s
   cluster b233a0b7-4e21-4375-bca8-e215c056cc25
   health HEALTH_OK
   monmap e1: 3 mons at
{10.254.253.1=10.254.253.1:6789/0,10.254.253.2=10.254.253.2:6789/0,10.254.253.3=10.254.253.3:6789/
          election epoch 26, quorum 0,1,2 10.254.253.1,10.254.253.2,10.254.253.3
   osdmap e387: 90 osds: 90 up, 90 in
          flags sortbitwise,require_jewel_osds
   pgmap v1730238: 1008 pgs, 11 pools, 3498 GB data, 886 kobjects
         10453 GB used, 235 TB / 245 TB avail
          1006 active+clean
             2 active+clean+scrubbing+deep
 client io 1090 kB/s rd, 92507 kB/s wr, 778 op/s rd, 904 op/s wr
posted @ 2022-08-30 12:39  XU-NING  阅读(88)  评论(0)    收藏  举报