代码改变世界

案例:归档自动清理脚本失效及连带影响

2020-07-16 08:58  AlfredZhao  阅读(959)  评论(0编辑  收藏  举报

前些天给同事准备一套模拟环境用于测试一个OGG问题:
环境架构:Oracle 11.2.0.4 RAC + 单实例11.2.0.4 ADG(同时作为OGG源端,OGG版本19.1.0.0.4) + 单实例19.3多租户(其中1个PDB作为OGG目标端,OGG版本19.1.0.0.4)
现象概述:发现OGG进程abended,原因是主库归档满,但是实际已配置归档自动清理脚本(归档空间使用大于90%时清理),进一步查看发现根源是归档清理失效,报错RMAN-08137,导致的影响有很多,首先主库无法进行测试数据写入,其次ADG备库产生延迟,然后OGG源端抽取进程因超时报错OGG-02149导致abended..

1.故障现象:归档清理报错RMAN-08137

自动归档清理失效,报错RMAN-08137,手工清理现象一样:

RMAN> DELETE NOPROMPT ARCHIVELOG ALL COMPLETED BEFORE 'SYSDATE-1/24';

RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+FRA/crmdb/archivelog/2020_07_08/thread_1_seq_422.426.1045198149 thread=1 sequence=422
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+FRA/crmdb/archivelog/2020_07_08/thread_1_seq_423.424.1045198157 thread=1 sequence=423
...

使用oerr查看RMAN-08137的描述:

[oracle@jystdrac1 logs]$ oerr rman 8137
8137, 3, "WARNING: archived log not deleted, needed for standby or upstream capture process"
// *Cause: An archived log that should have been deleted was not as it was
//         required by upstream capture process or Data Guard.
//         The next message identifies the archived log.
// *Action: This is an informational message. The archived log can be
//          deleted after it is no longer needed.  See the
//          documentation for Data Guard to alter the set of active
//          Data Guard destinations.  See the documentation for
//          Streams to alter the set of active streams.

查看rman的设置,未发现特殊设置:

RMAN> show all;

using target database control file instead of recovery catalog
RMAN configuration parameters for database with db_unique_name CRMDB are:
CONFIGURE RETENTION POLICY TO REDUNDANCY 1; # default
CONFIGURE BACKUP OPTIMIZATION OFF; # default
CONFIGURE DEFAULT DEVICE TYPE TO DISK; # default
CONFIGURE CONTROLFILE AUTOBACKUP OFF; # default
CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '%F'; # default
CONFIGURE DEVICE TYPE DISK PARALLELISM 1 BACKUP TYPE TO BACKUPSET; # default
CONFIGURE DATAFILE BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE ARCHIVELOG BACKUP COPIES FOR DEVICE TYPE DISK TO 1; # default
CONFIGURE MAXSETSIZE TO UNLIMITED; # default
CONFIGURE ENCRYPTION FOR DATABASE OFF; # default
CONFIGURE ENCRYPTION ALGORITHM 'AES128'; # default
CONFIGURE COMPRESSION ALGORITHM 'BASIC' AS OF RELEASE 'DEFAULT' OPTIMIZE FOR LOAD TRUE ; # default
CONFIGURE ARCHIVELOG DELETION POLICY TO NONE; # default
CONFIGURE SNAPSHOT CONTROLFILE NAME TO '+DATA/crmdb/snapcf_crmdb.f';

2.解决方案:设置"_deferred_log_dest_is_valid"参数

进一步搜索查询,匹配到MOS:
  • RMAN-08137 on Primary Database although Archive Destination to Standby is deferred (Doc ID 1380368.1)

给出的原因和解决方案引用如下:

CAUSE
If we defer an Archive Destination to a Standby Database, the Primary Database will still consider the Standby Database as existing but temporary unavailable eg. for Maintenance. This can happen if you stop Log Transport Services from the Data Guard Broker or manually defer the State for the Archive Destination.

SOLUTION
As long as the Archive Destination (log_archive_dest_n) is still set, we consider the Standby Database as still existing and preserve the ArchiveLogs on the Primary Database to perform Gap Resolution when the Archive Destination is valid again.
There are Situations when this is not wanted, eg. the Standby Database was activated or removed but you still keep the Archive Destination because you want to rebuild the Standby Database later again. In this Case you can set the hidden Parameter "_deferred_log_dest_is_valid" to FALSE (default TRUE) which will consider deferred Archive Destinations as completely unavailable and will not preserve ArchiveLogs for those Destinations any more. It is a dynamic Parameter and can be set this Way:

SQL> alter system set "_deferred_log_dest_is_valid" = FALSE scope=both;

NOTE: This Parameter has been introduced with Oracle Database 11.2.0.x. In earlier Versions you have to unset the log_archive_dest_n-Parameter pointing to the remote Standby Database to make the Primary Database considering it as completely unavailable. There also exists a Patch on Top of 11.1.0.7 for some Platforms to include this Parameter in 11.1.0.7, too. This is Patch Number 8468117.

上面的描述很清楚了,实际结合当前环境,发现确实是有其他log_archive_dest_n设置为defer,而这些我们实际不再用了,要么彻底清除,要么按照MOS设置隐藏参数,我们先查下这个隐藏参数的当前默认设置,发现是Ture:

NAME                                DESCRIPTION                                                        VALUE
----------------------------------- ------------------------------------------------------------------ ------------------------------
_deferred_log_dest_is_valid         consider deferred log dest as valid for log deletion (TRUE/FALSE)  TRUE

这是一个动态的隐藏参数,可以直接修改为FALSE:

alter system set "_deferred_log_dest_is_valid" = FALSE;

--再次查询,已经修改成功:
NAME                                DESCRIPTION                                                        VALUE
----------------------------------- ------------------------------------------------------------------ ------------------------------
_deferred_log_dest_is_valid         consider deferred log dest as valid for log deletion (TRUE/FALSE)  FALSE

3.恢复故障:归档清理恢复正常,ADG同步正常,OGG进程启动正常

3.1 归档清理恢复正常
上面设置隐藏参数之后,就可以正常删除归档了:

RMAN> DELETE NOPROMPT ARCHIVELOG ALL COMPLETED BEFORE 'SYSDATE-1/24';
...
archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_399.462.1045343023 RECID=1655 STAMP=1045343047
deleted archived log
archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_400.396.1045343069 RECID=1657 STAMP=1045343099
deleted archived log
archived log file name=+FRA/crmdb/archivelog/2020_07_09/thread_2_seq_401.404.1045343111 RECID=1661 STAMP=1045343151
Deleted 87 objects

3.2 ADG同步恢复正常
主库切下日志,再看ADG同步状态,最终恢复正常:

SQL> @dg

NAME                           VALUE                          UNIT                           TIME_COMPUTED                  DATUM_TIME
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
transport lag                  +06 03:04:01                   day(2) to second(0) interval   07/16/2020 00:10:08            07/16/2020 00:10:08
apply lag                      +06 03:04:01                   day(2) to second(0) interval   07/16/2020 00:10:08            07/16/2020 00:10:08
apply finish time              +00 00:00:04.358               day(2) to second(3) interval   07/16/2020 00:10:08
estimated startup time         65                             second                         07/16/2020 00:10:08

SQL> /

NAME                           VALUE                          UNIT                           TIME_COMPUTED                  DATUM_TIME
------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------
transport lag                  +00 00:00:00                   day(2) to second(0) interval   07/16/2020 00:10:16            07/16/2020 00:10:15
apply lag                      +00 00:00:00                   day(2) to second(0) interval   07/16/2020 00:10:16            07/16/2020 00:10:15
apply finish time              +00 00:00:00.000               day(2) to second(3) interval   07/16/2020 00:10:16
estimated startup time         65                             second                         07/16/2020 00:10:16

3.3 OGG进程启动正常
再将OGG的进程手工启动,恢复正常:

GGSCI (test03) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPAUDDG     00:00:00      00:00:10    
EXTRACT     ABENDED     EXTAUDDG    00:00:00      146:59:54 
GGSCI (test03) 2> view report EXTAUDDG
...
2020-07-09 21:11:37  ERROR   OGG-02149  Standby database has made no progress for more than 30,000 seconds.

GGSCI (test03) 3> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPAUDDG     00:00:00      00:00:02    
EXTRACT     ABENDED     EXTAUDDG    00:00:00      147:01:07   


GGSCI (test03) 4> start EXTAUDDG

Sending START request to MANAGER ...
EXTRACT EXTAUDDG starting


GGSCI (test03) 5> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
EXTRACT     RUNNING     DPAUDDG     00:00:00      00:00:00    
EXTRACT     RUNNING     EXTAUDDG    00:00:00      00:00:00    

--target ogg ok!
GGSCI (db19) 1> info all

Program     Status      Group       Lag at Chkpt  Time Since Chkpt

MANAGER     RUNNING                                           
REPLICAT    RUNNING     REPAUD1A    00:00:00      00:00:05 

至此,整个架构涉及到的所有环境均恢复正常。