记录近期修复Dataguard的过程
admin
2024-02-06 01:34:20

参考文档:
Bug 8656212 - Alert log with message "Warning: recovery process XXXX cannot use async I/O" (Doc ID 8656212.8)
https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/V-MANAGED_STANDBY.html

RDBMS 11.2.0.4 的主备库。
发生网络或者存储方面的故障后,DG出现一些问题。


问题1 
主库RAC,备库RAC,主备库不同步,主库上不再显示LNS服务进程相关的信息。也就是主库不再发送日志到备库。

解决方法:
重启备库RAC,重启后,主库的alert log中提示'LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2'
随后,继续传输到备库,备库media recover 后正常。这个比较奇怪,什么原因,不是很清楚。

Beginning log switch checkpoint up to RBA [0x21db2.2.10], SCN: 6255905417984
Thread 1 advanced to log sequence 138674 (LGWR switch)Current log# 1 seq# 138674 mem# 0: +DATA_ABC/prod/onlinelog/group_1.296.123456Current log# 1 seq# 138674 mem# 1: +RECO_DEF/prod/onlinelog/group_1.273.123456
Mon Nov 21 09:48:08 2022
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
LNS: Standby redo logfile selected for thread 1 sequence 138674 for destination LOG_ARCHIVE_DEST_2
Mon Nov 21 09:48:20 2022
Archived Log entry 500862 added for thread 1 sequence 138673 ID 0x14184331 dest 1:
Mon Nov 21 09:50:08 2022
FAL[server, ARC0]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance PROD1 - Archival Error. Archiver continuing.
Mon Nov 21 09:51:04 2022

问题2 
主库RAC,备库RAC ,主备库不同步,主库上有部分日志没有发送过去,逐个手工copy到备库注册后,可以media recover。但是发现主库上切换了日志后,几乎备库接收不到。
重启备库的MRP进程后,提示'ORA-27090: 无法为异步磁盘 I/O 保留内核资源 '等告警,如下:

Mon Nov 21 10:24:32 2022
Errors in file /u01/ABCDG/oracle/diag/rdbms/ABCDG/ab/trace/ab_pr0m_26593.trc:
ORA-27090: 无法为异步磁盘 I/O 保留内核资源 
Additional information: 3
Additional information: 128
Additional information: 166549176
Warning: recovery process PR0M cannot use async I/O
Mon Nov 21 10:24:32 2022
Errors in file /u01/ABCDG/oracle/diag/rdbms/ABCDG/ab/trace/ab_pr0n_26595.trc:
ORA-27090: 无法为异步磁盘 I/O 保留内核资源 
Additional information: 3
Additional information: 128
Additional information: 166549176
Warning: recovery process PR0N cannot use async I/O
Mon Nov 21 10:24:32 2022

继续查看对应的trc文件,如下:(可以看到KCBR: Number of read descriptors = 32)

******************* End of process map dump ************
----- Process Resource Limits -----
***************** Dumping Resource Limits(s/h) *****************
core file size                         0 KB/UNLIMITED
data seg size                     UNLIMITED/UNLIMITED
file size                         UNLIMITED/UNLIMITED
pending signals                     2010 KB/2010 KB
max locked memory                    238 GB/238 GB
max memory size                   UNLIMITED/UNLIMITED
open files                            64 KB/64 KB
POSIX message queues                 800 KB/800 KB
stack size                            32 MB/UNLIMITED
cpu time                          UNLIMITED/UNLIMITED
max user processes                    16 KB/16 KB
virtual memory                    UNLIMITED/UNLIMITED
file locks                        UNLIMITED/UNLIMITED
***************** End of Resource Limits Dump ******************
----- Java Stack -----
----- VKTM Time Drifts Circular Buffer -----
Warning: recovery process PR0M cannot use async I/O
KCBR: Number of read descriptors = 32*** 2022-11-21 14:10:50.276
KCBR: Media recovery blocks read (SYNC) = 180651
KCBR: Influx buffers flushed = 10145 times
KCBR: Redo cache copies/changes = 1100946/1100943

查看归档日志情况,发现MRP0的状态是WAIT_FOR_LOG

SYS@ab>select process,status,thread#,sequence# from v$managed_standby order by 3,12  ;PROCESS            STATUS                      THREAD#  SEQUENCE#
------------------ ------------------------ ---------- ----------
RFS                IDLE                              0          0
RFS                IDLE                              0          0
RFS                IDLE                              0          0
RFS                IDLE                              0          0
RFS                IDLE                              0          0
RFS                IDLE                              0          0
ARCH               CLOSING                           1     429441
ARCH               CLOSING                           1     418157
ARCH               CLOSING                           1     429440
ARCH               CLOSING                           1     429437
RFS                IDLE                              1     429442PROCESS            STATUS                      THREAD#  SEQUENCE#
------------------ ------------------------ ---------- ----------
MRP0               WAIT_FOR_LOG                      2     38156612 rows selected.

根据官方文档中对v$managed_standby的解释:
WAIT_FOR_LOG - Process is waiting for the archived redo log to be completed  (进程等待归档redo完成)
所以,就发生了归档日志一直传不过来,或者归档日志一直不应用。

解决方法:
根据MOS文档Bug 8656212 - Alert log with message "Warning: recovery process XXXX cannot use async I/O" (Doc ID 8656212.8)
设置alter system set "_media_recovery_read_batch" = 64 sid ='*' scope = spfile;

以下为MOS中的描述,从MOS中的描述,可以看到在某些情况下同步和异步IO会转换 
Description
Alert log can show the next message during recovery as the recovery processes
may not use ASYNC IO:
 
"Warning: recovery process XXXX cannot use async I/O"
 
Note that to confirm that async I/O is being used and in the case of 
recovery in a standby database, look for the following trace line in 
the MRP0 trace file after managed recovery has been started and stopped 
(the trace is printed upon completion of a recovery session):
 
Example:
 
Read rate (ASYNC): 11070353Kb in 123.79s => 87.33 Mb/sec
 
If sync I/O then SYNC will be displayed instead of ASYNC.
 
Workaround
 
Set _media_recovery_read_batch to 64 or 100    


设置完毕后,重启数据库,打开MRP进程,归档日志的传输和应用正常,查询v$managed_standby结果为APPLYING_LOG


问题3 
主库RAC,备库RAC,主库的节点1可以传输归档日志到备库,节点2开始报以下错误。节点2无法传送日志到备库。同时在主库上ping备库,是没有问题的。

Error 12514 received logging on to the standby
PING[ARC1]: Heartbeat failed to connect to standby 'mno'. Error is 12514.
Mon Nov 21 10:20:53 2022
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 
and that the SYS password is same in the password files.returning error ORA-16191
------------------------------------------------------------
PING[ARC1]: Heartbeat failed to connect to standby 'mno'. Error is 16191.
Mon Nov 21 10:21:54 2022
Error 1017 received logging on to the standby
------------------------------------------------------------

重启MRP进程后,也报异步IO方面的错误。

Errors in file /u01/AAADG/db/diag/rdbms/mno/mno1/trace/mno1_pr0t_5157.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Additional information: 3
Additional information: 128
Additional information: 166549176
Warning: recovery process PR0T cannot use async I/O
Mon Nov 21 11:27:07 2022
Errors in file /u01/AAADG/db/diag/rdbms/mno/mno1/trace/mno1_pr0u_5159.trc:
ORA-27090: Unable to reserve kernel resources for asynchronous disk I/O
Additional information: 3
Additional information: 128
Additional information: 166549176
Warning: recovery process PR0U cannot use async I/O

通过set "_media_recovery_read_batch" = 64 后解决异步IO的问题,随后启动数据库和MRP进程。alert 中提示如下:

通过以下alert log中查看,貌似密码文件问题(比较奇怪的是,数据库运行了很久了,重启后出现这个问题)

------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 
and that the SYS password is same in the password files.returning error ORA-16191
------------------------------------------------------------
Suppressing further error logging of LOG_ARCHIVE_DEST_2.
Mon Nov 21 14:31:49 2022
Thread 2 advanced to log sequence 1420 (LGWR switch)Current log# 4 seq# 1420 mem# 0: +DATA/prod/redo04.log
Mon Nov 21 14:31:49 2022
Archived Log entry 4772 added for thread 2 sequence 1419 ID 0x1da8a7e5 dest 1:
Mon Nov 21 14:31:50 2022
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 
and that the SYS password is same in the password files.returning error ORA-16191
------------------------------------------------------------
Suppressing further error logging of LOG_ARCHIVE_DEST_2.
FAL[server, ARC3]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance prod2 - Archival Error. Archiver continuing.
^C

从arc3进程中看,貌似节点2登录备库拒绝 

[oracle@AAAdb2 trace]$ ls -lrt *arc3*Initial buffer sizes: read 1024K, overflow 832K, change 805K
OCISessionBegin failed -1*** 2022-11-21 00:21:56.100
.. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
'
OCISessionBegin failed. Error -1
.. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
'*** 2022-11-21 00:21:57.136
OCISessionBegin failed. Error -1
.. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
'
OCISessionBegin failed. Error -1
.. Detailed OCI error val is 1017 and errmsg is 'ORA-01017: invalid username/password; logon denied
'
*** 2022-11-21 00:21:57.158 4329 krsh.c
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password fi

查看节点1和节点2上的密码文件,发现节点2上只有一个密码文件,随后,将节点1上的密码文件copy到节点1上。

[oracle@AAAdb2 dbs]$ ls -lrt orapw*
-rw-r----- 1 oracle oinstall 1536 Mar  2  2022 orapwprod2   --- 随后修改的这个文件 
[oracle@AAAdb2 dbs]$ [oracle@AAAdb1 dbs]$ ls -lrt orapw*
-rw-r----- 1 oracle oinstall 1536 Mar  2  2022 orapwprod1
-rw-r----- 1 oracle oinstall 1536 Mar  3  2022 orapwprod2
[oracle@AAAdb1 dbs]$ 

密码文件拷贝后,alert log中立刻显示可以发送日志了

Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 
and that the SYS password is same in the password files.returning error ORA-16191
------------------------------------------------------------
Mon Nov 21 15:05:01 2022
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 
and that the SYS password is same in the password files.returning error ORA-16191
------------------------------------------------------------
Mon Nov 21 15:06:02 2022
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE, 
and that the SYS password is same in the password files.returning error ORA-16191
------------------------------------------------------------
Mon Nov 21 15:07:05 2022
Thread 2 advanced to log sequence 1421 (LGWR switch)Current log# 3 seq# 1421 mem# 0: +DATA/prod/redo03.log
Mon Nov 21 15:07:05 2022
LNS: Standby redo logfile selected for thread 2 sequence 1421 for destination LOG_ARCHIVE_DEST_2
Mon Nov 21 15:07:05 2022
Archived Log entry 4774 added for thread 2 sequence 1420 ID 0x1da8a7e5 dest 1:
Mon Nov 21 15:07:05 2022
ARC3: Standby redo logfile selected for thread 2 sequence 1420 for destination LOG_ARCHIVE_DEST_2

总结:
前2个问题比较奇怪,通过业务了解,确实是底层的存储出了一些问题。
最后一个密码问题,也比较奇怪,密码不一致,运行了很久才暴露。

END 

相关内容

热门资讯

南京商旅取消收购关联资产 终止...   在经历了长达一年半的筹划后,南京商旅于 12 月 19 日晚间突然宣布,决定终止以发行股份及支付...
智改数转 向新而进!福州新区(...   12月26日,福州市召开“绘就‘十五五’发展新蓝图 奋勇争先推进中国式现代化福州实践”系列主题新...
喜欢一个人,就像喜欢富士山 如... 1、别再隐身了,我不会打扰你。 2、明知道天要下雨就该带把伞,明知道不会有结果就请别开始。 ...
优酷CEO古永锵经典语录 古永... 1、一家公司,两个品牌,两个网站,两个销售体系,在合作,开放,尊重,包容的团队氛围下达到双赢,一起实...
《悬崖》经典语录 饱食终日会让... 1、能留下来的才是最好的。 2、轻视对手的人,一定会付出代价。要知道在这场博弈里,没有弱者。 ...