PostgreSQL 容灾恢复

Posted on 2021-08-25 15:16 Sxcan 阅读(818) 评论(0) 收藏举报

Postgres容灾库恢复

做vacuum full时因归档产生速度过快，备库应用不及时，导致缺失归档，备库损坏

1.主库做pg_rman全备

#!/bin/bash
source ~/.bash_profile
###############################################
echo 'Start time ----------->' `date`
pg_rman backup --backup-mode=full -C -P
pg_rman validate
pg_rman purge
###############################################
echo 'End time ----------->' `date`

2.scp全备文件到容灾库

[postgres@primaryhost ~]$ pg_rman show
=====================================================================
 StartTime           EndTime              Mode    Size   TLI  Status 
=====================================================================
2021-05-18 18:30:01  2021-05-18 22:45:47  INCR    57GB     1  OK
2021-05-18 18:00:01  2021-05-18 18:15:23  ARCH  4474MB     1  OK
2021-05-17 18:00:01  2021-05-17 18:13:23  ARCH  3853MB     1  OK
2021-05-16 12:56:44  2021-05-16 19:53:32  FULL   118GB     1  OK
[postgres@primaryhost ~]$ scp -r $BACKUP_PATH/20210516 postgres@standbyip:/Data

3.容灾库安装pg_rman

# 解压安装包
[postgres@standbyhost software]$ tar -zxvf pg_rman-1.3.7-pg10.tar.gz
# 切换到安装路径
[postgres@standbyhost software]$ cd pg_rman-1.3.7-pg10
# 编译安装
[postgres@standbyhost software]$ make install
# 创建pg_rman备份路径
[postgres@standbyhost software]$ mkdir -p /pgdata03/pg_rman_backup
# 添加备份路径到环境变量
[postgres@standbyhost software]$ vi ~/.bash_profile
export BACKUP_PATH=/pgdata03/pg_rman_backup
# 重新加载环境变量
[postgres@standbyhost software]$ source ~/.bash_profile
# 初始化pg_rman，指定的目录不能有内容
[postgres@standbyhost software]$ pg_rman init -B /pgdata03/pg_rman_backup/
# 把主库scp过来的全备移动到备份目录里
[postgres@standbyhost software]$ mv 20210516 /pgdata03/pg_rman_backup/
# 查看全备信息
[postgres@standbyhost software]$ pg_rman show 
=====================================================================
 StartTime           EndTime              Mode    Size   TLI  Status 
=====================================================================
2021-05-16 12:56:44  2021-05-16 19:53:32  FULL   118GB     1  OK

4.scp主库pg_wal下的wal文件

从容灾库损坏时缺失的wal文件开始scp，少文件会导致恢复失败。

[postgres@primaryhost ~]$ cd $PGDATA/pg_wal
[postgres@primaryhost ~]$ scp 0000000100001C* postgres@standbyip:/archlog/

5.容灾库开始恢复

--recovery-target-time后接全备结束的时间,恢复时会将备库所有数据都清除，可以提前备份下postgres.conf和recovery.conf文件

[postgres@standbyhost ~]$ pg_rman restore --recovery-target-time='2021-05-16 19:53:32'
INFO: the recovery target timeline ID is not given
INFO: use timeline ID of current database cluster as recovery target: 1
INFO: calculating timeline branches to be used to recovery target point
INFO: searching latest full backup which can be used as restore start point
INFO: found the full backup can be used as base in recovery: "2021-05-16 12:56:44"
INFO: copying online WAL files and server log files
INFO: clearing restore destination
INFO: validate: "2021-05-16 12:56:44" backup and archive log files by SIZE
INFO: backup "2021-05-16 12:56:44" is valid
INFO: restoring database files from the full mode backup "2021-05-16 12:56:44"
INFO: searching incremental backup to be restored
INFO: searching backup which contained archived WAL files to be restored
INFO: backup "2021-05-16 12:56:44" is valid
INFO: restoring WAL files from backup "2021-05-16 12:56:44"
INFO: restoring online WAL files and server log files
INFO: generating recovery.conf
INFO: restore complete
HINT: Recovery will start automatically when the PostgreSQL server is started.

6.修改容灾库的postgres.conf

主备库配置(内存)不一样时需修改容灾库的postgres.conf对应的参数

7.修改recovery.conf文件

[postgres@primaryhost ~]$ more $PGDATA/recovery.conf
# recovery.conf generated by pg_rman 1.3.7
# restore_command = 'cp /archlog/tndmp21_arch/%f %p'
# recovery_target_time = '2021-08-30 04:35:07'
# recovery_target_timeline = '2' 
standby_mode = 'on'
primary_conninfo = 'user=repuser password=repuser host=primaryip port=primaryport sslmode=prefer sslcompression=1 target_session_attrs=any'
# recovery_min_apply_delay = 240min  # 先注释
recovery_target_timeline = 'latest'

8.创建逻辑複制槽目录

??主库存在逻辑複制槽时需手动创建逻辑複制槽目录并scp对应文件??
此步骤有疑问，第二次恢复时发现没有创建复制逻辑复制槽目录也全备恢复成功了，但第一次恢复确实有报错逻辑复制槽不存在的问题，报错信息没有保存下来，待考证

[postgres@standbyhost ~]$ mkdir -p $PGDATA/pg_replslot/slot_name1
[postgres@standbyhost ~]$ mkdir -p $PGDATA/pg_replslot/slot_name2
[postgres@primaryhost ~]$ scp $PGDATA/pg_replslot/slot_name1/state postgres@standbyip:$PGDATA/pg_replslot/slot_name1/
[postgres@primaryhost ~]$ scp $PGDATA/pg_replslot/slot_name2/state postgres@standbyip:$PGDATA/pg_replslot/slot_name2/

9.开启数据库

开启后备库会去应用wal日志，此时psql无法进入数据库

[postgres@standbyhost ~]$ pg_ctl start
waiting for server to start....2021-05-18 12:18:39.310 IST [51822] LOG:  listening on IPv4 address "0.0.0.0", port 1922
2021-05-18 12:18:39.310 IST [51822] LOG:  listening on IPv6 address "::", port 1922
2021-05-18 12:18:39.311 IST [51822] LOG:  listening on Unix socket "/tmp/.s.PGSQL.1922"
2021-05-18 12:18:39.617 IST [51822] LOG:  redirecting log output to logging collector process
2021-05-18 12:18:39.617 IST [51822] HINT:  Future log output will appear in directory "/Data/pg_log".
........................................................... stopped waiting
pg_ctl: server did not start in time

执行以下命令可以看到数据库应用进度

[postgres@standbyhost ~]$ pg_controldata         
pg_control version number:            1002
Catalog version number:               201707211
Database system identifier:           6673658210470541307
Database cluster state:               in archive recovery
pg_control last modified:             Tue 18 May 2021 12:22:16 PM IST
Latest checkpoint location:           1C23/403FAF98
Prior checkpoint location:            1C23/61E6678
Latest checkpoint's REDO location:    1C23/24E63F38
Latest checkpoint's REDO WAL file:    0000000100001C2300000001
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0:2358852717
Latest checkpoint's NextOID:          106761126
Latest checkpoint's NextMultiXactId:  3066
Latest checkpoint's NextMultiOffset:  6131
Latest checkpoint's oldestXID:        2308491367
Latest checkpoint's oldestXID's DB:   65381
Latest checkpoint's oldestActiveXID:  2358790515
Latest checkpoint's oldestMultiXid:   1656
Latest checkpoint's oldestMulti's DB: 13163
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Mon 17 May 2021 06:04:51 AM IST
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     1C23/7A8DBCE0
Min recovery ending loc's timeline:   1
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    logical
wal_log_hints setting:                off
max_connections setting:              10000
max_worker_processes setting:         28
max_prepared_xacts setting:           100
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 2097152
WAL block size:                       8192
Bytes per WAL segment:                536870912
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0
Mock authentication nonce:            9f382963477103d6a9ce144b52357c91638fe60a4319c9a89a6f4a852ce7843a

以下命令也可以查看应用进度

[postgres@dr-pg pgdata]$ ps -ef|grep -i postgres
root     27365  2071  0 Sep01 ?        00:00:00 sshd: postgres [priv]
postgres 27372 27365  0 Sep01 ?        00:00:00 sshd: postgres@pts/0
root     27373  2071  0 Sep01 ?        00:00:00 sshd: postgres [priv]
postgres 27376 27373  0 Sep01 ?        00:00:00 sshd: postgres@notty
postgres 27377 27372  0 Sep01 pts/0    00:00:00 -bash
postgres 27379 27376  0 Sep01 ?        00:00:00 /usr/libexec/openssh/sftp-server
postgres 27979 27377  0 Sep01 pts/0    00:00:00 vim postgresql.conf
root     30543  2071  0 12:43 ?        00:00:00 sshd: postgres [priv]
postgres 30623 30543  0 12:43 ?        00:00:00 sshd: postgres@pts/2
root     30624  2071  0 12:43 ?        00:00:00 sshd: postgres [priv]
postgres 30704 30623  0 12:43 pts/2    00:00:00 -bash
postgres 30706 30624  0 12:43 ?        00:00:00 sshd: postgres@notty
postgres 30720 30706  0 12:43 ?        00:00:00 /usr/libexec/openssh/sftp-server
postgres 48297 30704  1 12:50 pts/2    00:00:00 ps -ef
postgres 48298 30704  0 12:50 pts/2    00:00:00 grep -i postgres
postgres 56528     1  0 Aug31 ?        00:00:00 /postgresql/pgsql_10.6/bin/postgres
postgres 56729 56528  0 Aug31 ?        00:00:00 postgres: logger process
postgres 56730 56528  3 Aug31 ?        01:43:21 postgres: startup process   recovering 00000001000022B500000005
postgres 56731 56528  0 Aug31 ?        00:22:26 postgres: checkpointer process
postgres 56732 56528  0 Aug31 ?        00:01:42 postgres: writer process
postgres 57045 56528  0 Aug31 ?        00:02:19 postgres: stats collector process
postgres 57046 56528  2 Aug31 ?        01:34:34 postgres: wal receiver process   streaming 22B9/C2FEE8A0

10.删除容灾库逻辑複制槽

# 查看逻辑複制槽
[postgres@standbyhost ~]$ psql
psql (10.6)
Type "help" for help.

postgres=# select * from pg_replication_slots;
    slot_name     |    plugin     | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |  restart_lsn  | confirmed_flush_lsn 
------------------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+---------------+---------------------
 slot_name1 | test_decoding | logical   |  16384 | dbname  | f         | t      |       4386 |      |   2363531321 | 1C31/7E03C300 | 1C31/7FE6CBA8
 slot_name2 | test_decoding | logical   |  16384 | dbname  | f         | t      |       5068 |      |   2363532305 | 1C31/7FA43EA0 | 1C31/808D8A00
(2 rows)
# 删除逻辑複制槽
postgres=# SELECT * FROM pg_drop_replication_slot('slot_india_kafka');
postgres=# SELECT * FROM pg_drop_replication_slot('slot_zz_kafka');
# 再次查看
postgres=# select * from pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
(0 rows)

11.查看同步

同步跟上后取消recovery.conf文件里recovery_min_apply_delay = 240min的注释

postgres=# select * from pg_stat_wal_receiver;

补充：
在一次生产库容灾恢复中遇到问题，搜索很多资料并尝试，无果，看资料是主库有产生坏块，于是在无生产时对整库做了vacuum full，重新做了次全备，用vaccum full后的全备进行了恢复。

2021-08-23 13:48:48.014 IST,,,8233,,612359e6.2029,1,,2021-08-23 13:48:46 IST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destinatio
n ""csvlog"".",,,,,,,""
2021-08-23 13:48:48.057 IST,,,8237,,612359e8.202d,1,,2021-08-23 13:48:48 IST,,0,LOG,00000,"database system was interrupted while in recovery at log time 2021-08-22 0
6:15:03 IST",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,""
2021-08-23 13:48:50.221 IST,,,8237,,612359e8.202d,2,,2021-08-23 13:48:48 IST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2021-08-23 13:48:50.260 IST,,,8237,,612359e8.202d,3,,2021-08-23 13:48:48 IST,1/0,0,LOG,00000,"redo starts at 205D/876DF7D8",,,,,,,,,""
2021-08-23 13:50:50.483 IST,,,8237,,612359e8.202d,4,,2021-08-23 13:48:48 IST,1/0,0,WARNING,01000,"page 602111 of relation pg_tblspc/13051978/PG_10_201707211/16384/10
6654802 is uninitialized",,,,,,,,,""
......
2021-08-23 13:50:50.489 IST,,,8237,,612359e8.202d,736,,2021-08-23 13:48:48 IST,1/0,0,WARNING,01000,"page 600316 of relation pg_tblspc/13051978/PG_10_201707211/16384/
106654802 is uninitialized",,,,,,,,,""
2021-08-23 13:50:50.489 IST,,,8237,,612359e8.202d,737,,2021-08-23 13:48:48 IST,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,,,,,""
2021-08-23 13:50:52.891 IST,,,8233,,612359e6.2029,2,,2021-08-23 13:48:46 IST,,0,LOG,00000,"startup process (PID 8237) was terminated by signal 6: Aborted",,,,,,,,,""
2021-08-23 13:50:52.891 IST,,,8233,,612359e6.2029,3,,2021-08-23 13:48:46 IST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2021-08-23 13:50:53.208 IST,,,8233,,612359e6.2029,4,,2021-08-23 13:48:46 IST,,0,LOG,00000,"database system is shut down",,,,,,,,,""

刷新页面返回顶部

Sxcan

公告