现象: 

执行 ETCDCTL_API=3  /opt/etcd/bin/etcdctl --endpoints="http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379" endpoint status -w table 查看存储情况,告警信息提示NOSPACE

分析原因:etcd存储满了,这边存储空间并非磁盘空间,而是etcd默认设置了空间配额,默认为2G

处理方法:

#先重新选主【适用3个etcd包括主节点都挂掉的情况】

kubectl exec -it  sts/apisix-etcd -n ingress-apisix  --  /opt/bitnami/etcd/bin/etcdctl --endpoints="http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379" move-leader 62c3c5516bb89b91【member ID】 

#这行命令的作用是获取Etcd集群中的当前修订版本(revision)号

rev=$(ETCDCTL_API=3  /opt/etcd/bin/etcdctl --endpoints="http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379" endpoint status -w json | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*'| awk 'NR==1{print $1}')

#用于在Etcd中进行修订版本的压缩操作

ETCDCTL_API=3  /opt/etcd/bin/etcdctl --endpoints="http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379"  compact $rev

#用于在 Etcd 存储空间中进行碎片整理操作,注意:碎片整理会阻塞对etcd的读写操作,大量数据的defrag建议逐台进行,以免影响集群稳定性,一般在业务低峰期执行此操作,一个一个来操作

ETCDCTL_API=3  /opt/etcd/bin/etcdctl --endpoints="http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379" defrag

#查看Etcd 集群的节点状态信息

ETCDCTL_API=3  /opt/etcd/bin/etcdctl --endpoints="http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379" endpoint status -w table

#列出告警状态

ETCDCTL_API=3  /opt/etcd/bin/etcdctl --endpoints="http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379" alarm list

#解除 Etcd 集群中的警报状态

ETCDCTL_API=3  /opt/etcd/bin/etcdctl --endpoints="http://apisix-etcd-0.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-1.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379,http://apisix-etcd-2.apisix-etcd-headless.ingress-apisix.svc.cluster.local:2379" alarm disarm

注意:在没有备份的情况下恢复数据遵循如下规则(很重要!)

#提前1、先备份好原来的数据文件夹,sts 部署的etcd,确保 主节点是0节点

#前提2、如果是因为数据库空间满导致的集群挂掉 需要调整ETCD_QUOTA_BACKEND_BYTES值为当前空间数据更多的值

#查看数据库存储数据使用情况

du -sh /data/member/snap/db 

1、保证ETCD配置文件ETCD_INITIAL_CLUSTER_STATE =new

2、按照以上步骤,alarm disarm解除告警,保障单个ETCD节点能正常启动(期间注意member 组是单个节点的成员)

3、改动ETCD_INITIAL_CLUSTER_STATE =existing,并依然保障单个节点能正常启动

4、关闭异常的2个节点,并清理异常节点的数据

5、遵循一个节点一个节点加入member组的原则

6、即可验证没有备份数据的情况下,能快速修复etcd集群数据

posted on 2023-11-10 17:37  MhaiM  阅读(565)  评论(0)    收藏  举报