模拟故障
| Node | IP | State |
| etcd-1 | 172.21.130.169 | True |
| etcd-2 | 172.21.130.168 | Flase |
| etcd-3 | 172.28.17.85 | Flase |
主动停止其他两个节点
[root@master2 ~]# systemctl stop etcd
[root@master1 ~]# systemctl stop etcd
查看集群可用与否(直接查个数据即可验证可用性)
[root@master ~]# etcdctl ${ep} get hello
{"level":"warn","ts":"2021-05-21T15:47:28.624+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-ef52c0b1-6bee-4ea0-851f-e841d615af83/172.21.130.169:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
[root@master ~]# etcdctl ${ep} endpoint status
{"level":"warn","ts":"2021-05-21T15:47:40.977+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://172.21.130.168:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 172.21.130.168:2379: connect: connection refused\""}
Failed to get the status of endpoint https://172.21.130.168:2379 (context deadline exceeded)
{"level":"warn","ts":"2021-05-21T15:47:45.978+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://172.28.17.85:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 172.28.17.85:2379: connect: connection refused\""}
Failed to get the status of endpoint https://172.28.17.85:2379 (context deadline exceeded)
https://172.21.130.169:2379, 4c978cbca553cd70, 3.4.16, 20 kB, false, false, 6, 11, 11, etcdserver: no leader
查看发现集群已经瘫痪了,Leader仲裁如期丢失
数据很重要,怎么办。
快照什么都用不了,那就冷备份吧。
开始cp数据
[root@master ~]# \cp -a /var/local/etcd/data/ ./data.`date +%F-%S`
[root@master ~]# ls
data.2021-05-21 data.2021-05-21-37 etcd_ssl etcd-v3.4.16-linux-amd64 etcd-v3.4.16-linux-amd64.tar.gz
[root@master ~]#
准备修改配置文件与启动文件
[root@master ~]# cat /var/local/etcd/cfg/etcd.conf
#[Member]
ETCD_NAME="etcd-1"
ETCD_DATA_DIR="/var/local/etcd/data/default.etcd"
ETCD_LISTEN_PEER_URLS="https://172.21.130.169:2380"
ETCD_LISTEN_CLIENT_URLS="https://172.21.130.169:2379"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.21.130.169:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://172.21.130.169:2379"
ETCD_INITIAL_CLUSTER="etcd-1=https://172.21.130.169:2380,etcd-2=https://172.21.130.168:2380,etcd-3=https://172.28.17.85:2380"
ETCD_INITIAL_CLUSTER_TOKEN="my-etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new" //如果没有数据的集群用new,有数据的用existing切记!!
#[security]
ETCD_CERT_FILE="/var/local/etcd/ssl/server.pem"
ETCD_KEY_FILE="/var/local/etcd/ssl/server-key.pem"
ETCD_TRUSTED_CA_FILE="/var/local/etcd/ssl/ca.pem"
ETCD_PEER_CERT_FILE="/var/local/etcd/ssl/member.pem"
ETCD_PEER_KEY_FILE="/var/local/etcd/ssl/member-key.pem"
ETCD_PEER_TRUSTED_CA_FILE="/var/local/etcd/ssl/ca.pem"
ETCD_CLIENT_CERT_AUTH="true"
ETCD_PEER_CLIENT_CERT_AUTH="true"
[root@master ~]#
修改启动文件
[root@master ~]# cat /usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
EnvironmentFile=/var/local/etcd/cfg/etcd.conf
ExecStart=/usr/local/bin/etcd --auto-compaction-retention=1 \
--max-request-bytes=31457280 \
--quota-backend-bytes=1073741824 \
--force-new-cluster=true \
--logger=zap
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
[root@master ~]#
解释一下这个参数,作用就是强制刷新实际就是重置了集群ID与集群所有成员的信息,启动之后就是单节点。配合state使用记得模式用new(因为等于拆集群抛开两个节点单节点)。(集群瘫痪成单节点在恢复集群有奇效)
[root@master ~]# systemctl daemon-reload
[root@master ~]# systemctl restart etcd
[root@master ~]# systemctl status etcd
● etcd.service - Etcd Server
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: active (running) since 五 2021-05-21 15:48:57 CST; 5s ago
Main PID: 11801 (etcd)
CGroup: /system.slice/etcd.service
└─11801 /usr/local/bin/etcd --auto-compaction-retention=1 --max-request-bytes=31457280 --quota-backend-bytes=1073741824 --force-new-cluster=true --logger=zap
5月 21 15:48:56 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:56.210+0800","caller":"rafthttp/peer.go:340","msg":"stopped remote peer","remote-p...43b3d8ce1"}
5月 21 15:48:56 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:56.210+0800","caller":"rafthttp/transport.go:369","msg":"removed remote peer","loc...43b3d8ce1"}
5月 21 15:48:57 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:57.993+0800","caller":"raft/raft.go:923","msg":"4c978cbca553cd70 is starting a new...t term 77"}
5月 21 15:48:57 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:57.993+0800","caller":"raft/raft.go:713","msg":"4c978cbca553cd70 became candidate at term 78"}
5月 21 15:48:57 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:57.993+0800","caller":"raft/raft.go:824","msg":"4c978cbca553cd70 received MsgVoteR...t term 78"}
5月 21 15:48:57 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:57.993+0800","caller":"raft/raft.go:765","msg":"4c978cbca553cd70 became leader at term 78"}
5月 21 15:48:57 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:57.993+0800","caller":"raft/node.go:325","msg":"raft.node: 4c978cbca553cd70 electe...t term 78"}
5月 21 15:48:57 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:57.994+0800","caller":"etcdserver/server.go:2037","msg":"published local member to cluster th...
5月 21 15:48:57 master systemd[1]: Started Etcd Server.
5月 21 15:48:57 master etcd[11801]: {"level":"info","ts":"2021-05-21T15:48:57.996+0800","caller":"embed/serve.go:191","msg":"serving client traffic securely"....169:2379"}
Hint: Some lines were ellipsized, use -l to show in full.
[root@master ~]# etcdctl --endpoints=https://172.21.130.169:2379 endpoint status -w fields
"ClusterID" : 2943589120715358745
"MemberID" : 5519034610221305200
"Revision" : 3
"RaftTerm" : 78
"Version" : "3.4.16"
"DBSize" : 20480
"Leader" : 5519034610221305200
"IsLearner" : false
"RaftIndex" : 15
"RaftTerm" : 78
"RaftAppliedIndex" : 15
"Errors" : []
"Endpoint" : "https://172.21.130.169:2379"
[root@master ~]# etcdctl --endpoints=https://172.21.130.169:2379 endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.21.130.169:2379 | 4c978cbca553cd70 | 3.4.16 | 20 kB | true | false | 78 | 15 | 15 | |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
是的没错,我们的好朋友回来了。接下来看看数据。
[root@master ~]# etcdctl --endpoints=https://172.21.130.169:2379 get hello
hello
world
[root@master ~]# etcdctl --endpoints=https://172.21.130.169:2379 get test
test
123
数据还在,恢复服务。但是,是单节点怎么办。
数据很重要,为了那就在备份一下吧。
[root@master ~]# \cp -a /var/local/etcd/data/ ./data.`date +%F-%S`.single
[root@master ~]# ls
data.2021-05-21 data.2021-05-21-41.single etcd_ssl etcd-v3.4.16-linux-amd64 etcd-v3.4.16-linux-amd64.tar.gz
开始集群恢复
先删除多余的数据目录(恢复那个节点操作就行因为数据都备份在那边)
rm -rf /var/local/etcd/data
把备份的集群数据目录移动过去并且把名字改成data
[root@master ~]# rm -rf /var/local/etcd/data
[root@master ~]# ls /var/local/etcd
bin cfg ssl
[root@master ~]# cp -a data.2021-05-22 /var/local/etcd/data
[root@master ~]# tree /var/local/etcd/data
/var/local/etcd/data
└── default.etcd
└── member
├── snap
│ └── db
└── wal
├── 0000000000000000-0000000000000000.wal
└── 0.tmp
4 directories, 3 files
[root@master ~]#
先重启两个瘫痪节点的服务,然后重启那台刚刚单节点测试数据是否保留的设备
[root@master2 ~]# systemctl restart etcd
[root@master1 ~]# systemctl restart etcd
[root@master ~]# systemctl restart etcd
查验集群状态
[root@master ~]# etcdctl ${ep} endpoint status -w fields
"ClusterID" : 2943589120715358745
"MemberID" : 5519034610221305200
"Revision" : 3
"RaftTerm" : 280
"Version" : "3.4.16"
"DBSize" : 20480
"Leader" : 14703050348134501601
"IsLearner" : false
"RaftIndex" : 132
"RaftTerm" : 280
"RaftAppliedIndex" : 132
"Errors" : []
"Endpoint" : "https://172.21.130.169:2379"
"ClusterID" : 2943589120715358745
"MemberID" : 14703050348134501601
"Revision" : 3
"RaftTerm" : 280
"Version" : "3.4.16"
"DBSize" : 20480
"Leader" : 14703050348134501601
"IsLearner" : false
"RaftIndex" : 132
"RaftTerm" : 280
"RaftAppliedIndex" : 132
"Errors" : []
"Endpoint" : "https://172.21.130.168:2379"
"ClusterID" : 2943589120715358745
"MemberID" : 6237433037948641366
"Revision" : 3
"RaftTerm" : 280
"Version" : "3.4.16"
"DBSize" : 20480
"Leader" : 14703050348134501601
"IsLearner" : false
"RaftIndex" : 132
"RaftTerm" : 280
"RaftAppliedIndex" : 132
"Errors" : []
"Endpoint" : "https://172.28.17.85:2379"
[root@master ~]#
再看看数据在不在
[root@master ~]# etcdctl ${ep} get hello
hello
world
[root@master ~]# etcdctl ${ep} get test
test
123
至此集群彻底恢复,运转正常
针对,生活我不是想赢。我只是不想输!
浙公网安备 33010602011771号