ceph集群常见告警信息整理
一、 1 pool(s) do not have an application enabled
告警信息
[root@ceph141 ~]# ceph -s
cluster:
id: de7264fa-0e36-11f0-8f7b-9771d5b41507
health: HEALTH_WARN
2 pool(s) do not have an application enabled
查看健康状态
[root@ceph141 ~]# ceph health detail
HEALTH_WARN 2 pool(s) do not have an application enabled
[WRN] POOL_APP_NOT_ENABLED: 2 pool(s) do not have an application enabled
application not enabled on pool 'mysql'
application not enabled on pool 'docker-data'
use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
原因:集群未声明作用,声明一下就好了
二、1 pools have many more objects per pg than average
告警信息
[root@ceph141 ~]# ceph -s
cluster:
id: 11e66474-0e02-11f0-82d6-4dcae3d59070
health: HEALTH_WARN
1 pools have many more objects per pg than average
错误分析
- 查看具体是哪个存储池导致的告警
[root@ceph141 ~]# ceph health detail
HEALTH_WARN 1 pools have many more objects per pg than average
[WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than average
pool dezyan objects per pg (837) is more than 26.1562 times cluster average (32)
- 问题分析
报错说明的很明确了,就是因为dezyan存储池每个PG的objects数量远大于平均值32,当前每个pg有837个对象,其中是32的26.1562倍。
- 分享数据存储在哪些节点上
[root@ceph141 ~]# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 5.3 TiB 5.3 TiB 89 GiB 89 GiB 1.62
TOTAL 5.3 TiB 5.3 TiB 89 GiB 89 GiB 1.62
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 449 KiB 2 1.3 MiB 0 1.6 TiB
dezyan 8 8 25 GiB 6.70k 76 GiB 1.51 1.6 TiB
linux96 11 16 1000 B 8 61 KiB 0 1.6 TiB
cephfs_data 12 256 76 MiB 3.31k 252 MiB 0 1.6 TiB
cephfs_metadata 13 32 20 MiB 28 61 MiB 0 1.6 TiB
[root@ceph141 ~]# rbd ls dezyan
child-xixi-001
docker
harbor
mysql80
node-exporter
prometheus
prometheus-server
ubuntu-2204
wordpress-db
[root@ceph141 ~]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.29300 1.00000 300 GiB 4.3 GiB 3.1 GiB 4 KiB 1.2 GiB 296 GiB 1.43 0.88 57 up
1 hdd 0.48830 1.00000 500 GiB 21 GiB 19 GiB 14 KiB 1.8 GiB 479 GiB 4.19 2.58 90 up
2 hdd 1.00000 1.00000 1024 GiB 4.6 GiB 3.2 GiB 4 KiB 1.4 GiB 1019 GiB 0.45 0.28 166 up
3 hdd 0.29300 1.00000 300 GiB 7.3 GiB 6.3 GiB 10 KiB 1.1 GiB 293 GiB 2.44 1.51 53 up
4 hdd 0.48830 1.00000 500 GiB 11 GiB 9.6 GiB 54 KiB 1.1 GiB 489 GiB 2.14 1.32 87 up
5 hdd 1.00000 1.00000 1024 GiB 11 GiB 9.7 GiB 14 KiB 1.3 GiB 1013 GiB 1.07 0.66 173 up
6 hdd 0.29300 1.00000 300 GiB 4.5 GiB 3.2 GiB 4 KiB 1.3 GiB 295 GiB 1.50 0.93 49 up
7 hdd 0.48830 1.00000 500 GiB 4.5 GiB 3.3 GiB 4 KiB 1.2 GiB 495 GiB 0.91 0.56 89 up
8 hdd 1.00000 1.00000 1024 GiB 21 GiB 19 GiB 15 KiB 1.8 GiB 1003 GiB 2.03 1.25 175 up
TOTAL 5.3 TiB 89 GiB 77 GiB 127 KiB 12 GiB 5.3 TiB 1.62
MIN/MAX VAR: 0.28/2.58 STDDEV: 1.05
[root@ceph141 ~]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5.34389 root default
-3 1.78130 host ceph141
0 hdd 0.29300 osd.0 up 1.00000 1.00000
1 hdd 0.48830 osd.1 up 1.00000 1.00000
2 hdd 1.00000 osd.2 up 1.00000 1.00000
-5 1.78130 host ceph142
3 hdd 0.29300 osd.3 up 1.00000 1.00000
4 hdd 0.48830 osd.4 up 1.00000 1.00000
5 hdd 1.00000 osd.5 up 1.00000 1.00000
-7 1.78130 host ceph143
6 hdd 0.29300 osd.6 up 1.00000 1.00000
7 hdd 0.48830 osd.7 up 1.00000 1.00000
8 hdd 1.00000 osd.8 up 1.00000 1.00000
[root@ceph141 ~]# ceph osd pool ls detail | grep dezyan
pool 8 'dezyan' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode off last_change 358 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 3.38
- 分析得出
我们的dezyan存储池仅有8个pg,导致每个pg的objects数量过大。
也就是说,总的pg数量应该为: 209
[root@ceph141 ~]# echo "837 * 8"/32 | bc
209
解决方案
- 方案一: 将现有的存储池的pg数量增大【推荐】
伴随问题:此方案会涉及到现有pg的objects迁移到新的pg的IO操作。
[root@ceph141 ~]# ceph osd pool set dezyan pg_num 256
set pool 8 pg_num to 256
[root@ceph141 ~]# ceph osd pool set dezyan pgp_num 256
set pool 8 pgp_num to 256
[root@ceph141 ~]# ceph -s
cluster:
id: 11e66474-0e02-11f0-82d6-4dcae3d59070
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph141,ceph142,ceph143 (age 20h)
mgr: ceph141.mbakds(active, since 2d), standbys: ceph142.qgifwo
mds: 1/1 daemons up, 1 standby
osd: 9 osds: 9 up (since 20h), 9 in (since 2d)
data:
volumes: 1/1 healthy
pools: 5 pools, 441 pgs
objects: 8.88k objects, 21 GiB
usage: 87 GiB used, 5.3 TiB / 5.3 TiB avail
pgs: 21.769% pgs unknown
8.617% pgs not active
307 active+clean
96 unknown
38 peering
[root@ceph141 ~]# ceph osd pool ls detail | grep dezyan
pool 8 'dezyan' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 142 pgp_num 14 pg_num_target 256 pgp_num_target 256 autoscale_mode off last_change 478 lfor 0/0/478 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 3.99
- 方案二: 移除无用的块设备数据,减少objects数量
伴随问题:此方案也需要进行I/O操作,但是成本相比于方案一更快,只是数据的删除。但是有局限性,比如现有的数据无法删除。
- 方案三: 禁用mon_pg_warn_max_object_skew告警【需要时间验证测试,等一段看看,目前来说等了3min不好使】
1.查看示例
[root@ceph141 ~]# ceph config get osd osd_pool_default_pg_num # 获取默认的pg数量
32
[root@ceph141 ~]#
[root@ceph141 ~]# ceph config get osd mon_pg_warn_max_object_skew
10.000000
2.修改示例
[root@ceph141 ~]# ceph config set osd mon_pg_warn_max_object_skew 0
[root@ceph141 ~]#
[root@ceph141 ~]# ceph config get osd mon_pg_warn_max_object_skew
0.000000
3.貌似等了2min效果不是很明显
[root@ceph141 ~]# ceph -s
cluster:
id: 11e66474-0e02-11f0-82d6-4dcae3d59070
health: HEALTH_WARN
1 pools have many more objects per pg than average
services:
mon: 3 daemons, quorum ceph141,ceph142,ceph143 (age 20h)
mgr: ceph141.mbakds(active, since 2d), standbys: ceph142.qgifwo
mds: 1/1 daemons up, 1 standby
osd: 9 osds: 9 up (since 20h), 9 in (since 2d)
data:
volumes: 1/1 healthy
pools: 5 pools, 313 pgs
objects: 10.04k objects, 26 GiB
usage: 87 GiB used, 5.3 TiB / 5.3 TiB avail
pgs: 313 active+clean
io:
client: 117 KiB/s wr, 0 op/s rd, 3 op/s wr
本文来自博客园,作者:丁志岩,转载请注明原文链接:https://www.cnblogs.com/dezyan/p/18811461

浙公网安备 33010602011771号