Postgres容灾库恢复
做vacuum full时因归档产生速度过快,备库应用不及时,导致缺失归档,备库损坏
1.主库做pg_rman全备
#!/bin/bash
source ~/.bash_profile
###############################################
echo 'Start time ----------->' `date`
pg_rman backup --backup-mode=full -C -P
pg_rman validate
pg_rman purge
###############################################
echo 'End time ----------->' `date`
2.scp全备文件到容灾库
[postgres@primaryhost ~]$ pg_rman show
=====================================================================
StartTime EndTime Mode Size TLI Status
=====================================================================
2021-05-18 18:30:01 2021-05-18 22:45:47 INCR 57GB 1 OK
2021-05-18 18:00:01 2021-05-18 18:15:23 ARCH 4474MB 1 OK
2021-05-17 18:00:01 2021-05-17 18:13:23 ARCH 3853MB 1 OK
2021-05-16 12:56:44 2021-05-16 19:53:32 FULL 118GB 1 OK
[postgres@primaryhost ~]$ scp -r $BACKUP_PATH/20210516 postgres@standbyip:/Data
3.容灾库安装pg_rman
# 解压安装包
[postgres@standbyhost software]$ tar -zxvf pg_rman-1.3.7-pg10.tar.gz
# 切换到安装路径
[postgres@standbyhost software]$ cd pg_rman-1.3.7-pg10
# 编译安装
[postgres@standbyhost software]$ make install
# 创建pg_rman备份路径
[postgres@standbyhost software]$ mkdir -p /pgdata03/pg_rman_backup
# 添加备份路径到环境变量
[postgres@standbyhost software]$ vi ~/.bash_profile
export BACKUP_PATH=/pgdata03/pg_rman_backup
# 重新加载环境变量
[postgres@standbyhost software]$ source ~/.bash_profile
# 初始化pg_rman,指定的目录不能有内容
[postgres@standbyhost software]$ pg_rman init -B /pgdata03/pg_rman_backup/
# 把主库scp过来的全备移动到备份目录里
[postgres@standbyhost software]$ mv 20210516 /pgdata03/pg_rman_backup/
# 查看全备信息
[postgres@standbyhost software]$ pg_rman show
=====================================================================
StartTime EndTime Mode Size TLI Status
=====================================================================
2021-05-16 12:56:44 2021-05-16 19:53:32 FULL 118GB 1 OK
4.scp主库pg_wal下的wal文件
从容灾库损坏时缺失的wal文件开始scp,少文件会导致恢复失败。
[postgres@primaryhost ~]$ cd $PGDATA/pg_wal
[postgres@primaryhost ~]$ scp 0000000100001C* postgres@standbyip:/archlog/
5.容灾库开始恢复
--recovery-target-time后接全备结束的时间,恢复时会将备库所有数据都清除,可以提前备份下postgres.conf和recovery.conf文件
[postgres@standbyhost ~]$ pg_rman restore --recovery-target-time='2021-05-16 19:53:32'
INFO: the recovery target timeline ID is not given
INFO: use timeline ID of current database cluster as recovery target: 1
INFO: calculating timeline branches to be used to recovery target point
INFO: searching latest full backup which can be used as restore start point
INFO: found the full backup can be used as base in recovery: "2021-05-16 12:56:44"
INFO: copying online WAL files and server log files
INFO: clearing restore destination
INFO: validate: "2021-05-16 12:56:44" backup and archive log files by SIZE
INFO: backup "2021-05-16 12:56:44" is valid
INFO: restoring database files from the full mode backup "2021-05-16 12:56:44"
INFO: searching incremental backup to be restored
INFO: searching backup which contained archived WAL files to be restored
INFO: backup "2021-05-16 12:56:44" is valid
INFO: restoring WAL files from backup "2021-05-16 12:56:44"
INFO: restoring online WAL files and server log files
INFO: generating recovery.conf
INFO: restore complete
HINT: Recovery will start automatically when the PostgreSQL server is started.
6.修改容灾库的postgres.conf
主备库配置(内存)不一样时需修改容灾库的postgres.conf对应的参数
7.修改recovery.conf文件
[postgres@primaryhost ~]$ more $PGDATA/recovery.conf
# recovery.conf generated by pg_rman 1.3.7
# restore_command = 'cp /archlog/tndmp21_arch/%f %p'
# recovery_target_time = '2021-08-30 04:35:07'
# recovery_target_timeline = '2'
standby_mode = 'on'
primary_conninfo = 'user=repuser password=repuser host=primaryip port=primaryport sslmode=prefer sslcompression=1 target_session_attrs=any'
# recovery_min_apply_delay = 240min # 先注释
recovery_target_timeline = 'latest'
8.创建逻辑複制槽目录
??主库存在逻辑複制槽时需手动创建逻辑複制槽目录并scp对应文件??
此步骤有疑问,第二次恢复时发现没有创建复制逻辑复制槽目录也全备恢复成功了,但第一次恢复确实有报错逻辑复制槽不存在的问题,报错信息没有保存下来,待考证
[postgres@standbyhost ~]$ mkdir -p $PGDATA/pg_replslot/slot_name1
[postgres@standbyhost ~]$ mkdir -p $PGDATA/pg_replslot/slot_name2
[postgres@primaryhost ~]$ scp $PGDATA/pg_replslot/slot_name1/state postgres@standbyip:$PGDATA/pg_replslot/slot_name1/
[postgres@primaryhost ~]$ scp $PGDATA/pg_replslot/slot_name2/state postgres@standbyip:$PGDATA/pg_replslot/slot_name2/
9.开启数据库
开启后备库会去应用wal日志,此时psql无法进入数据库
[postgres@standbyhost ~]$ pg_ctl start
waiting for server to start....2021-05-18 12:18:39.310 IST [51822] LOG: listening on IPv4 address "0.0.0.0", port 1922
2021-05-18 12:18:39.310 IST [51822] LOG: listening on IPv6 address "::", port 1922
2021-05-18 12:18:39.311 IST [51822] LOG: listening on Unix socket "/tmp/.s.PGSQL.1922"
2021-05-18 12:18:39.617 IST [51822] LOG: redirecting log output to logging collector process
2021-05-18 12:18:39.617 IST [51822] HINT: Future log output will appear in directory "/Data/pg_log".
........................................................... stopped waiting
pg_ctl: server did not start in time
执行以下命令可以看到数据库应用进度
[postgres@standbyhost ~]$ pg_controldata
pg_control version number: 1002
Catalog version number: 201707211
Database system identifier: 6673658210470541307
Database cluster state: in archive recovery
pg_control last modified: Tue 18 May 2021 12:22:16 PM IST
Latest checkpoint location: 1C23/403FAF98
Prior checkpoint location: 1C23/61E6678
Latest checkpoint's REDO location: 1C23/24E63F38
Latest checkpoint's REDO WAL file: 0000000100001C2300000001
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:2358852717
Latest checkpoint's NextOID: 106761126
Latest checkpoint's NextMultiXactId: 3066
Latest checkpoint's NextMultiOffset: 6131
Latest checkpoint's oldestXID: 2308491367
Latest checkpoint's oldestXID's DB: 65381
Latest checkpoint's oldestActiveXID: 2358790515
Latest checkpoint's oldestMultiXid: 1656
Latest checkpoint's oldestMulti's DB: 13163
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Mon 17 May 2021 06:04:51 AM IST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 1C23/7A8DBCE0
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: logical
wal_log_hints setting: off
max_connections setting: 10000
max_worker_processes setting: 28
max_prepared_xacts setting: 100
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 2097152
WAL block size: 8192
Bytes per WAL segment: 536870912
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: 9f382963477103d6a9ce144b52357c91638fe60a4319c9a89a6f4a852ce7843a
以下命令也可以查看应用进度
[postgres@dr-pg pgdata]$ ps -ef|grep -i postgres
root 27365 2071 0 Sep01 ? 00:00:00 sshd: postgres [priv]
postgres 27372 27365 0 Sep01 ? 00:00:00 sshd: postgres@pts/0
root 27373 2071 0 Sep01 ? 00:00:00 sshd: postgres [priv]
postgres 27376 27373 0 Sep01 ? 00:00:00 sshd: postgres@notty
postgres 27377 27372 0 Sep01 pts/0 00:00:00 -bash
postgres 27379 27376 0 Sep01 ? 00:00:00 /usr/libexec/openssh/sftp-server
postgres 27979 27377 0 Sep01 pts/0 00:00:00 vim postgresql.conf
root 30543 2071 0 12:43 ? 00:00:00 sshd: postgres [priv]
postgres 30623 30543 0 12:43 ? 00:00:00 sshd: postgres@pts/2
root 30624 2071 0 12:43 ? 00:00:00 sshd: postgres [priv]
postgres 30704 30623 0 12:43 pts/2 00:00:00 -bash
postgres 30706 30624 0 12:43 ? 00:00:00 sshd: postgres@notty
postgres 30720 30706 0 12:43 ? 00:00:00 /usr/libexec/openssh/sftp-server
postgres 48297 30704 1 12:50 pts/2 00:00:00 ps -ef
postgres 48298 30704 0 12:50 pts/2 00:00:00 grep -i postgres
postgres 56528 1 0 Aug31 ? 00:00:00 /postgresql/pgsql_10.6/bin/postgres
postgres 56729 56528 0 Aug31 ? 00:00:00 postgres: logger process
postgres 56730 56528 3 Aug31 ? 01:43:21 postgres: startup process recovering 00000001000022B500000005
postgres 56731 56528 0 Aug31 ? 00:22:26 postgres: checkpointer process
postgres 56732 56528 0 Aug31 ? 00:01:42 postgres: writer process
postgres 57045 56528 0 Aug31 ? 00:02:19 postgres: stats collector process
postgres 57046 56528 2 Aug31 ? 01:34:34 postgres: wal receiver process streaming 22B9/C2FEE8A0
10.删除容灾库逻辑複制槽
# 查看逻辑複制槽
[postgres@standbyhost ~]$ psql
psql (10.6)
Type "help" for help.
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
------------------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
slot_name1 | test_decoding | logical | 16384 | dbname | f | t | 4386 | | 2363531321 | 1C31/7E03C300 | 1C31/7FE6CBA8
slot_name2 | test_decoding | logical | 16384 | dbname | f | t | 5068 | | 2363532305 | 1C31/7FA43EA0 | 1C31/808D8A00
(2 rows)
# 删除逻辑複制槽
postgres=# SELECT * FROM pg_drop_replication_slot('slot_india_kafka');
postgres=# SELECT * FROM pg_drop_replication_slot('slot_zz_kafka');
# 再次查看
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
(0 rows)
11.查看同步
同步跟上后取消recovery.conf文件里recovery_min_apply_delay = 240min的注释
postgres=# select * from pg_stat_wal_receiver;
补充:
在一次生产库容灾恢复中遇到问题,搜索很多资料并尝试,无果,看资料是主库有产生坏块,于是在无生产时对整库做了vacuum full,重新做了次全备,用vaccum full后的全备进行了恢复。
2021-08-23 13:48:48.014 IST,,,8233,,612359e6.2029,1,,2021-08-23 13:48:46 IST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destinatio
n ""csvlog"".",,,,,,,""
2021-08-23 13:48:48.057 IST,,,8237,,612359e8.202d,1,,2021-08-23 13:48:48 IST,,0,LOG,00000,"database system was interrupted while in recovery at log time 2021-08-22 0
6:15:03 IST",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,""
2021-08-23 13:48:50.221 IST,,,8237,,612359e8.202d,2,,2021-08-23 13:48:48 IST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2021-08-23 13:48:50.260 IST,,,8237,,612359e8.202d,3,,2021-08-23 13:48:48 IST,1/0,0,LOG,00000,"redo starts at 205D/876DF7D8",,,,,,,,,""
2021-08-23 13:50:50.483 IST,,,8237,,612359e8.202d,4,,2021-08-23 13:48:48 IST,1/0,0,WARNING,01000,"page 602111 of relation pg_tblspc/13051978/PG_10_201707211/16384/10
6654802 is uninitialized",,,,,,,,,""
......
2021-08-23 13:50:50.489 IST,,,8237,,612359e8.202d,736,,2021-08-23 13:48:48 IST,1/0,0,WARNING,01000,"page 600316 of relation pg_tblspc/13051978/PG_10_201707211/16384/
106654802 is uninitialized",,,,,,,,,""
2021-08-23 13:50:50.489 IST,,,8237,,612359e8.202d,737,,2021-08-23 13:48:48 IST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,,,,,""
2021-08-23 13:50:52.891 IST,,,8233,,612359e6.2029,2,,2021-08-23 13:48:46 IST,,0,LOG,00000,"startup process (PID 8237) was terminated by signal 6: Aborted",,,,,,,,,""
2021-08-23 13:50:52.891 IST,,,8233,,612359e6.2029,3,,2021-08-23 13:48:46 IST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2021-08-23 13:50:53.208 IST,,,8233,,612359e6.2029,4,,2021-08-23 13:48:46 IST,,0,LOG,00000,"database system is shut down",,,,,,,,,""
浙公网安备 33010602011771号