Kubernetes ETCD备份恢复操作指南
✅ 环境确认
|
项目 |
值 |
|
命名空间 |
|
|
ETCD Pod 名称格式 |
|
|
部署方式 |
Bitnami Helm Chart(内置 StatefulSet) |
|
协议 |
HTTP(非 TLS) |
|
客户端访问地址 |
|
|
数据目录 |
|
|
ETCDCTL_API |
3 |
|
认证 |
未启用 TLS/证书认证 |
✅ 最终定制版:自动备份方案
请保存以下 YAML 文件为:
vim etcd-backup-cronjob.yaml
然后直接执行 kubectl apply -f etcd-backup-cronjob.yaml 即可使用。
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: etcd-backup-pvc
namespace: apisix
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: apisix
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
securityContext:
fsGroup: 0 # Pod 级别设置
imagePullSecrets: # 私有仓库认证
- name: huaweicloud-registry-secret
containers:
- name: etcd-backup
image: swr.cn-north-4.myhuaweicloud.com/cfhy-common/etcd:3.5
imagePullPolicy: IfNotPresent
securityContext:
runAsUser: 0 # 容器内以 root 运行
runAsGroup: 0
command:
- /bin/bash
- -c
- |
set -e
BACKUP_DIR=/backup
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_FILE=${BACKUP_DIR}/etcd-snapshot-${TIMESTAMP}.db
echo "=== 开始 etcd 快照备份 ==="
etcdctl --endpoints="http://apisix-etcd.apisix.svc.cluster.local:2379" \
snapshot save ${SNAPSHOT_FILE}
echo "=== 校验快照 ==="
etcdctl snapshot status ${SNAPSHOT_FILE}
echo "=== 清理 7 天前快照 ==="
find ${BACKUP_DIR} -type f -name "etcd-snapshot-*.db" -mtime +7 -delete || true
echo "=== 列出现有快照 ==="
ls -lh ${BACKUP_DIR}
env:
- name: ETCDCTL_API
value: "3"
- name: TZ
value: Asia/Shanghai
volumeMounts:
- name: backup
mountPath: /backup
readOnly: false
restartPolicy: OnFailure
volumes:
- name: backup
persistentVolumeClaim:
claimName: etcd-backup-pvc
🔍 验证执行结果
首次执行可以手动触发一次:
kubectl create job --from=cronjob/etcd-backup -n apisix etcd-backup-manual
查看任务状态:
kubectl get pods -n apisix | grep etcd-backup
kubectl logs -n apisix <pod-name>
正常输出会包含:
=== 开始 etcd 快照备份 ===
Snapshot saved at /backup/etcd-snapshot-20251111-020000.db
=== 校验快照 ===
=== 当前备份列表 ===
📂 查看备份文件
#压测
ls -al /mnt/test-nfs/cfhy-pet-cce-01/pvc-243394b9-3f15-4094-9d24-7c6c2fcf64e8/
#小绿
/mnt/green-prod-nfs/cfhy-prod-green-cce/pvc-12b4f3cf-9c20-4357-abfd-1796e06f9d08
#小蓝
ls -al /mnt/prod-nfs/cfhy-prod-cce/pvc-6ea800d4-9584-4aa4-9919-7b35f610d33a
输出示例:
-rw-rw---- 1 root root 2363424 Dec 15 15:40 etcd-snapshot-20251215-074032.db
-rw-rw---- 1 root root 2367520 Dec 15 15:44 etcd-snapshot-20251215-154421.db
-rw------- 1 root root 2363424 Dec 16 02:00 etcd-snapshot-20251216-020011.db
♻️ 恢复流程(灾难恢复/迁移)
1,确保 etcd 完全停掉
kubectl scale statefulset apisix-etcd -n apisix --replicas=0
确认是否全部停止
kubectl get pod -n apisix | grep etcd
2,创建恢复pod
用“restore Pod”挂载同一个 PVC
⚠️ 注意:
- 这个 Pod 只做 restore
- 不会启动 etcd
cat etcd-restore.yaml
apiVersion: v1
kind: Pod
metadata:
name: etcd-restore
namespace: apisix
spec:
restartPolicy: Never
securityContext:
fsGroup: 0
containers:
- name: restore
image: swr.cn-north-4.myhuaweicloud.com/cfhy-common/etcd:3.5
securityContext:
runAsUser: 0
runAsGroup: 0
command: ["/bin/bash","-c"]
args:
- |
set -e
echo "== 清空旧 data-dir =="
rm -rf /bitnami/etcd/data/*
echo "== 开始 restore =="
etcdctl snapshot restore /backup/etcd-snapshot-20251222-020012.db \
--name apisix-etcd-0 \
--initial-cluster apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380 \
--initial-cluster-token apisix-etcd-cluster \
--initial-advertise-peer-urls http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380 \
--data-dir /bitnami/etcd/data
echo "restore done"
env:
- name: ETCDCTL_API
value: "3"
volumeMounts:
- name: data
mountPath: /bitnami/etcd
- name: backup
mountPath: /backup
volumes:
- name: data
persistentVolumeClaim:
claimName: data-apisix-etcd-0
- name: backup
persistentVolumeClaim:
claimName: etcd-backup-pvc
注:更换备份文件名(改为需要恢复的文件名)
kubectl apply -f etcd-restore.yaml
kubectl logs -n apisix etcd-restore
看到 restore done 即成功。
3,删除 restore Pod(可选)
kubectl delete pod etcd-restore -n apisix
4,启动 etcd
kubectl scale statefulset apisix-etcd -n apisix --replicas=1
若pod启动报以下错误
{"level":"warn","ts":"2025-12-22T04:02:21.307808Z","caller":"rafthttp/stream.go:653","msg":"request sent was ignored by remote peer due to cluster ID mismatch","remote-peer-id":"90126cc714381e07","remote-peer-cluster-id":"bfbce2358fdf94e6","local-member-id":"2c16fb63879f0d98","local-member-cluster-id":"b0d7015fda1525c8","error":"cluster ID mismatch"}
修改变量重新启动
- name: ETCD_INITIAL_CLUSTER_STATE
value: new
- name: ETCD_INITIAL_CLUSTER
value: apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380
修改以上变量为
- name: ETCD_INITIAL_CLUSTER_STATE
value: "new"
- name: ETCD_INITIAL_CLUSTER
value: "apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380"
5,恢复后必做校验(别跳过)
kubectl exec -n apisix apisix-etcd-0 -- \
etcdctl get /apisix --prefix --keys-only | head
看到以下内容即恢复成功
/apisix/consumer_groups/
/apisix/consumers/
/apisix/global_rules/
/apisix/global_rules/1
/apisix/plugin_configs/
‼️恢复后出现报错时看以下配置及操作
出现报错时当前应该是一个节点正常,增加副本后副本启动异常,此时操作以下步骤
- name: ETCD_INITIAL_CLUSTER_STATE
value: "new"
- name: ETCD_INITIAL_CLUSTER
value: "apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380"
修改以上变量为
- name: ETCD_INITIAL_CLUSTER_STATE
value: new
- name: ETCD_INITIAL_CLUSTER
value: apisix-etcd-0=http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-1=http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2380,apisix-etcd-2=http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380
等待pod启动,若pod正启动则本次操作全部完成
若依然报错执行以下操作
# 检查集群状态
kubectl exec -n apisix apisix-etcd-1 -- etcdctl member list
2c16fb63879f0d98, started, apisix-etcd-1, http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false
3ff1b5cd453a87df, started, apisix-etcd-2, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false
90126cc714381e07, started, apisix-etcd-0, http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380, , false
此状态为 apisix-etcd-1,apisix-etcd-2为正常
apisix-etcd-0 为异常节点
执行以下操作修复
# 删除相关pod及其pvc,集群会自动重建,重新加入集群(删除vpc时若卡住,手动删除对应pod即可)
kubectl delete pvc data-apisix-etcd-0 -n apisix
再次检查集群状态,此状态为正常
kubectl exec -n apisix apisix-etcd-1 -- etcdctl member list
2c16fb63879f0d98, started, apisix-etcd-1, http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false
3ff1b5cd453a87df, started, apisix-etcd-2, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false
90126cc714381e07, started, apisix-etcd-0, http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-0.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false
《不怕走得慢,只怕停下来。》

浙公网安备 33010602011771号