达梦数据库守护集群备库异常宕机,主库短暂SUSPEND
问题现象:20260507 02:37:56 达梦主库 实例状态变成 SUSPEND状态02:39:11自动恢复正常 20260507 02:40:04 达梦主库 远程同步归档状态变成 INVALID状态 20260507 06:06:38 达梦备库 自动 SHUTDOWN ABORT并自动重启 20260507 06:10:30 达梦主库 远程同步归档状态自动回复为 VALID状态影响范围:20260507 02:37:56 -02:39:1 期间达梦主库 实例状态变成 SUSPEND状态会影响数据正常写入。 20260507 02:40:04 - 06:10 期间 达梦主、备库数据同步中断。问题分析:分别查看:主库、备库节点的:实例日志、守护日志、dmap作业日志、core日志等监控器节点的:监控器日志主库实例日志:— dm_CJCDB01_202605.log关键日志如下:2026-05-07 02:37:55.987 [WARNING] check rarch_mpp_real_time timeout overtime 900s 2026-05-07 02:37:59.779 [INFO] utsk_cmd_exec-utsk_set_arch_fail_invalid, dseq[1775732768], code[0] sys_status:SUSPEND! 2026-05-07 02:37:59.780 [INFO] utsk_cmd_add, received sql exec cmd:1, dseq:1775732769, sql:ALTER DATABASE OPEN FORCE 2026-05-07 02:37:59.780 [INFO] pha_altdb_open_for_dw alter database open success!完整日志如下:2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029046 check rarch_mpp_real_time timeout overtime 900s 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029046 mal_site_port_close_by_id mal_seqno[1] 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029046 MAL receive site(0) lost connect to site(1), ctl_handle(12), data_handle(13), dsc_handle(0) 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029028 mal_site_tsk_check site(0) connect lost to site(1), socket handle 0, mal sys status 0, try get port again 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029029 mal_site_letter_recv code-6007, errno0, site(0) recv from site(1) failed, socket handle 0 2026-05-07 02:37:55.987 [INFO] database P0000028986 T0000000000000029028 send CMD_MAL_LINK_CHECK(350): (mal_id:1833027003, stmt_id:5029994, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1) 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029027 mal_site_tsk_check site(0) connect lost to site(1), socket handle 0, mal sys status 0, try get port again 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029029 MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0) 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029030 mal_site_letter_recv code-6007, errno0, site(0) recv from site(1) failed, socket handle 0 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029030 MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0) 2026-05-07 02:37:55.987 [ERROR] database P0000028986 T0000000000000029046 self_site(0) to dest_site(1) port_closed, return EC_CONNECT_LOST 2026-05-07 02:37:55.987 [ERROR] database P0000028986 T0000000000000029046 [mal recv for arch] mal receive from site(CJCDB02) failed, begin lsn:353845701, end lsn:353845705, code:-6021 2026-05-07 02:37:55.987 [WARNING] database P0000028986 T0000000000000029046 check rarch_mpp_real_time timeout overtime 900s 2026-05-07 02:37:55.987 [ERROR] database P0000028986 T0000000000000029046 send realtime archive to instance[CJCDB02] failed, code -6021, begin_lsn 353845701, end_lsn 353845705! 2026-05-07 02:37:55.988 [INFO] database P0000028986 T0000000000000029046 rlog4_process_arch_failed, need_suspend:1 2026-05-07 02:37:55.988 [WARNING] database P0000028986 T0000000000000029029 site(0) ctl_link mal_site_letter_recv from site(1) failed, socket handle 0, mal sys status is 0, try to get mal_port again 2026-05-07 02:37:55.988 [WARNING] database P0000028986 T0000000000000029030 site(0) data_link mal_site_letter_recv from site(1) failed, socket handle 0, mal sys status is 0, try to get mal_port again 2026-05-07 02:37:55.997 [INFO] database P0000028986 T0000000000000029046 Send archive log to remote instance failed, switch all ep to SUSPEND status success! 2026-05-07 02:37:56.262 [INFO] database P0000028986 T0000000000001473688 trx_view_mode: 1, trxid: 842404417, cmtarr_cmtseq: 4683743612465315840, next_id: 140737488355327, min_active_id:842037400, snap_cmtseq: 59532536, table(interface): am_node_node, nrec_trxid:842037522, code: -7120, file: /home/dmops/build/svns/1703094217743/op/cscn2.c, line: 13582 2026-05-07 02:37:57.755 [INFO] database P0000028986 T0000000000001482402 trx_view_mode: 1, trxid: 842404427, cmtarr_cmtseq: 5404319552844595200, next_id: 140737488355327, min_active_id:842037400, snap_cmtseq: 59532546, table(interface): tbl_devbaseinfo, nrec_trxid:842038283, code: -7120, file: /home/dmops/build/svns/1703094217743/op/cscn2.c, line: 13582 2026-05-07 02:37:57.866 [INFO] database P0000028986 T0000000000001473688 trx_view_mode: 1, trxid: 842404429, cmtarr_cmtseq: 5548434740920451072, next_id: 140737488355327, min_active_id:842037400, snap_cmtseq: 59532548, table(interface): am_node_node, nrec_trxid:842037522, code: -7120, file: /home/dmops/build/svns/1703094217743/op/cscn2.c, line: 13582 2026-05-07 02:37:58.055 [INFO] database P0000028986 T0000000000001466850 trx_view_mode: 1, trxid: 842404431, cmtarr_cmtseq: 5692549928996306944, next_id: 140737488355327, min_active_id:842037400, snap_cmtseq: 59532550, table(interface): am_node_node, nrec_trxid:842037522, code: -7120, file: /home/dmops/build/svns/1703094217743/op/cscn2.c, line: 13582 2026-05-07 02:37:58.058 [INFO] database P0000028986 T0000000000001473683 trx_view_mode: 1, trxid: 842404433, cmtarr_cmtseq: 5836665117072162816, next_id: 140737488355327, min_active_id:842037400, snap_cmtseq: 59532552, table(interface): am_node_node, nrec_trxid:842037522, code: -7120, file: /home/dmops/build/svns/1703094217743/op/cscn2.c, line: 13582 2026-05-07 02:37:58.618 [INFO] database P0000028986 T0000000000001482588 trx_view_mode: 1, trxid: 842404438, cmtarr_cmtseq: 6196953087261802496, next_id: 140737488355327, min_active_id:842037400, snap_cmtseq: 59532557, table(interface): tbl_devbaseinfo, nrec_trxid:842038283, code: -7120, file: /home/dmops/build/svns/1703094217743/op/cscn2.c, line: 13582 2026-05-07 02:37:58.636 [INFO] database P0000028986 T0000000000001482590 trx_view_mode: 1, trxid: 842404440, cmtarr_cmtseq: 6341068275337658368, next_id: 140737488355327, min_active_id:842037400, snap_cmtseq: 59532559, table(interface): tbl_devbaseinfo, nrec_trxid:842038283, code: -7120, file: /home/dmops/build/svns/1703094217743/op/cscn2.c, line: 13582 2026-05-07 02:37:58.778 [INFO] database P0000028986 T0000000000000029092 utsk_cmd_add, cmd info: cmd217, dseq1775732767, name_in, begin_lsn-1! 2026-05-07 02:37:58.778 [INFO] database P0000028986 T0000000000000029092 utsk_set_global_dw_stat, begin, msg_dseq:1775732767 2026-05-07 02:37:58.778 [INFO] database P0000028986 T0000000000000029092 set g_dw_stat from NONE to DW_FAILOVER success, g_dw_recover_stop is 0 2026-05-07 02:37:58.778 [INFO] database P0000028986 T0000000000000029092 utsk_set_global_dw_stat, finished, msg_dseq:1775732767, set code:0 2026-05-07 02:37:59.779 [INFO] database P0000028986 T0000000000000029092 utsk_cmd_add, cmd info: cmd214, dseq1775732768, name_in, begin_lsn-1! 2026-05-07 02:37:59.779 [INFO] database P0000028986 T0000000000000029092 Change CJCDB02 arch status from VALID to INVALID, arch_type[REALTIME] 2026-05-07 02:37:59.779 [INFO] database P0000028986 T0000000000000029092 utsk_cmd_exec-utsk_set_arch_fail_invalid, dseq[1775732768], code[0] sys_status:SUSPEND! 2026-05-07 02:37:59.780 [INFO] database P0000028986 T0000000000000029092 utsk_cmd_add, received sql exec cmd:1, dseq:1775732769, sql:ALTER DATABASE OPEN FORCE 2026-05-07 02:37:59.780 [INFO] database P0000028986 T0000000000000029092 utsk_cmd_add, cmd info: cmd1, dseq1775732769, name_in, begin_lsn-1! 2026-05-07 02:37:59.780 [INFO] database P0000028986 T0000000000000029092 pha_altdb_open_for_dw alter database open start... 2026-05-07 02:37:59.780 [INFO] database P0000028986 T0000000000000029092 pha_altdb_open_for_dw, altdb set changing end! 2026-05-07 02:37:59.780 [INFO] database P0000028986 T0000000000000029092 altdb open rlog_flush_notify_ex start! 2026-05-07 02:37:59.780 [INFO] database P0000028986 T0000000000000029092 altdb open rlog_flush_notify_ex end! 2026-05-07 02:37:59.780 [INFO] database P0000028986 T0000000000000029092 pha_altdb_open_for_dw alter database open success!备库实例日志– dmsql_CJCDB02_20260507_061031.log关键日志如下:2026-05-07 02:22:55.904 [INFO] database P0001051914 T0000000000001052493 rapply_redos_wait, redos_buf_used 17305600, redos_buf_num 4097, total_wait_cnt 0, need wait the existing tasks to be applied firstly... 2026-05-07 06:06:38.916 [FATAL] database P0001051914 T0000000000002002352 [for dem]SYSTEM SHUTDOWN ABORT. 2026-05-07 06:10:22.003 [INFO] database P0002728524 T0000000000002728524 DM Database Server 64 V8 03134284XXX-20XX1220-212XX1-200XX startup...完整日志如下:2026-05-07 02:22:28.670 [INFO] database P0001051914 T0000000000001052493 utsk_set_rapply_stat_flag failed, wait_flag:0 2026-05-07 02:22:34.675 [INFO] database P0001051914 T0000000000001052491 utsk_set_rapply_stat_flag failed, wait_flag:0 2026-05-07 02:22:55.904 [INFO] database P0001051914 T0000000000001052493 rapply_redos_wait, redos_buf_used 17305600, redos_buf_num 4097, total_wait_cnt 0, need wait the existing tasks to be applied firstly... 2026-05-07 02:23:28.715 [INFO] database P0001051914 T0000000000001052491 utsk_set_rapply_stat_flag failed, wait_flag:0 2026-05-07 02:23:40.724 [INFO] database P0001051914 T0000000000001052492 utsk_set_rapply_stat_flag failed, wait_flag:0 2026-05-07 02:23:46.728 [INFO] database P0001051914 T0000000000001052491 utsk_set_rapply_stat_flag failed, wait_flag:0 ...... 2026-05-07 02:37:55.987 [WARNING] database P0001051914 T0000000000001052231 mal_site_letter_recv code-6007, errno107, site(1) recv from site(0) failed, socket handle 299 2026-05-07 02:37:55.987 [WARNING] database P0001051914 T0000000000001052232 mal_site_letter_recv code-6007, errno0, site(1) recv from site(0) failed, socket handle 319 2026-05-07 02:37:55.987 [WARNING] database P0001051914 T0000000000001052231 MAL receive site(1) lost connect to site(0), ctl_handle(299), data_handle(319), dsc_handle(0) 2026-05-07 02:37:55.987 [WARNING] database P0001051914 T0000000000001052230 mal_site_tsk_check site(1) connect lost to site(0), socket handle 0, mal sys status 0, try get port again 2026-05-07 02:37:55.987 [WARNING] database P0001051914 T0000000000001052232 MAL receive site(1) lost connect to site(0), ctl_handle(0), data_handle(0), dsc_handle(0) 2026-05-07 02:37:55.987 [WARNING] database P0001051914 T0000000000001052229 mal_site_tsk_check site(1) connect lost to site(0), socket handle 0, mal sys status 0, try get port again 2026-05-07 02:37:55.987 [INFO] database P0001051914 T0000000000001052230 send CMD_MAL_LINK_CHECK(350): (mal_id:1833027003, stmt_id:5029997, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:0, build_time:-1) 2026-05-07 02:37:55.987 [WARNING] database P0001051914 T0000000000001052231 site(1) ctl_link mal_site_letter_recv from site(0) failed, socket handle 0, mal sys status is 0, try to get mal_port again 2026-05-07 02:37:55.987 [WARNING] database P0001051914 T0000000000001052232 site(1) data_link mal_site_letter_recv from site(0) failed, socket handle 0, mal sys status is 0, try to get mal_port again 2026-05-07 02:37:55.987 [INFO] database P0001051914 T0000000000001052229 send CMD_MAL_LINK_CHECK(350): (mal_id:1833027003, stmt_id:5029998, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:0, build_time:26822192) 2026-05-07 02:38:26.821 [INFO] database P0001051914 T0000000000001052237 site[1] mal_site_ctl_port_set from site[0, IP: 192.168.0.101, port_num: 15339], socket handle 57, site_magic 44902, link_seq 7 2026-05-07 02:38:26.821 [INFO] database P0001051914 T0000000000001052230 mal_site_port_get site_magic:44902, src_site:1, dst_site:0 2026-05-07 02:38:26.821 [INFO] database P0001051914 T0000000000001052231 mal_site_port_get site_magic:44902, src_site:1, dst_site:0 2026-05-07 02:38:26.822 [INFO] database P0001051914 T0000000000001052237 site[1] mal_site_data_port_set to site[0, IP: 192.168.0.101, port_num: 15339], socket handle 58, site_magic 44902, link_seq 7 2026-05-07 02:38:26.822 [INFO] database P0001051914 T0000000000001052229 mal_site_port_get site_magic:44902, src_site:1, dst_site:0 2026-05-07 02:38:26.822 [INFO] database P0001051914 T0000000000001052232 mal_site_port_get site_magic:44902, src_site:1, dst_site:0 ...... 2026-05-07 06:06:38.916 [FATAL] database P0001051914 T0000000000002002352 [for dem]SYSTEM SHUTDOWN ABORT. 2026-05-07 06:06:38.917 [FATAL] database P0001051914 T0000000000002002352 standby instance rrec_redos_lock_xxx failed, halt now 2026-05-07 06:06:38.917 [FATAL] database P0001051914 T0000000000002002352 code -6403, dm_sys_halt now!!! ...... 2026-05-07 06:10:22.003 [INFO] database P0002728524 T0000000000002728524 DM Database Server 64 V8 03134284XXX-20XX1220-212XX1-200XX startup...日志分析:2026-05-07 02:22:55.904 报错如下:rapply_redos_wait, redos_buf_used 17305600, redos_buf_num 4097, total_wait_cnt 0, need wait the existing tasks to be applied firstly...redos_buf_size 备库待重演日志堆积的内存限制当前配置1024MB实际使用17305600(16.5MB)redos_buf_num 备库待重演日志缓冲区允许堆积的数目限制当前配置是4096实际使用到了4097所以触发了上面的报错。相关参数说明:https://eco.dameng.com/document/dm/zh-cn/pm/configuration-description查看数据库相关参数:SQL show parameter redos_buf_size LINEID para_name para_value ---------- -------------- ---------- 1 REDOS_BUF_SIZE 1024 SQL show parameter redos_buf_num LINEID para_name para_value ---------- ------------- ---------- 1 REDOS_BUF_NUM 4096 SQL SHOW PARAMETER REDOS_PARALLEL_NUM LINEID para_name para_value ---------- ------------------ ---------- 1 REDOS_PARALLEL_NUM 1 SQL show parameter buffer_pools LINEID para_name para_value ---------- ----------------- ---------- 1 HUGE_BUFFER_POOLS 4 2 BUFFER_POOLS 17备库core日志:rootCJC-DB-02:/dm8/core#ls -lrth ...... -rw------- 1 dmdba dinstall 36G May 7 06:10 core-dm_tskwrk_thd-1051914-5分析core日志:gdb /dm8/dbms/bin_debug/dmserver_s.debug core-dm_tskwrk_thd-1051914-5 set logging file /dm8/core/20260507dm.log set logging on thread apply all bt set logging off继续分析:dmrdc sfilecore-dm_tskwrk_thd-1051914-5 dfileresult_20260507.txt查看日志:more result_20260507.txt内容如下:dmdbaCJC-DB-02:/dm8/core$cat result_20260507.txt !#%*^$[1927253]:backup database full backupset CJCDB02_BAK_2026_5_6_17_02_23 device type tape parms 1778058143883xxx !#%*^$[-1]:ALTER DATABASE OPEN FORCE通过core日志分析备库自动重启和 backup database 备份有关继续查看备份日志。第三方备份工具日志:通过备份命令可知是第三方备份工具发起的远程备份检查备份日志:vi /var/log/XXXBackup/XXClientService/AggregateApp/XXX_dmdba/xxxproc.log 05-06 17:02:23.916606 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMThread.cpp:68): begin - ncDMBackupThread::run () 05-06 17:02:23.916663 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMBackupExec.cpp:370): ncDMBackupExec::OnMessage (msg : 开始备份数据库: CJCDB02。) 05-06 17:02:23.916872 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMThread.cpp:89): backupset:CJCDB02_BAK_2026_5_6_17_02_23 05-06 17:02:23.916879 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMThread.cpp:135): begin - dmdbManager-ExecuteCommand (cmd: backup database full backupset CJCDB02_BAK_2026_5_6_17_02_23 device type tape parms 1778058143883xxx) 05-06 17:02:23.916884 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMThread.cpp:136): server:CJCDB02,user:sysdba,port:15239 05-06 17:02:23.916891 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMCore.cpp:134): ncDMCore::Init (server: CJCDB02, username: sysdba) 05-06 17:02:23.916895 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMLoadDll.cpp:62): begin - ncLoadDpiModule () 05-06 17:02:23.916899 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMCore.cpp:154): ncDMCore::Login () 05-06 17:02:23.916902 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMCore.cpp:76): ncInitDMEnv::initEnvironment () 05-06 17:02:23.920248 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMDbManager.cpp:591): ncDMDbManager::ExecuteCommand begin .... 05-06 17:02:37.792926 xxxproc[1926672][1926672](info)(dmlogmoduleschedule ncDMBackupExec.cpp:1336): ncDMBackupExec::sendDataBlockObject (_offset :1185054208) 05-06 17:02:37.971540 xxxproc[1926672][1926672](info)(dmlogmoduleschedule ncDMBackupExec.cpp:1361): sendDataBlockObject ----------MSGID:4, msgLength: 8552960 05-06 17:02:42.949607 xxxproc[1926672][1927163](info)(eossclient ncMsgClientHandler.cpp:154): this 0x0xfffce0298e80, error 接收数据时发生错误连接将断开。错误提供者POSIX System错误值104错误位置ncNetIOHandler.cpp:411 05-06 17:02:42.949794 xxxproc[1926672][1927163](info)(eossclient ncMsgClientHandler.cpp:154): this 0x0xfffce03ebcc0, error 接收数据时发生错误连接将断开。错误提供者POSIX System错误值104错误位置ncNetIOHandler.cpp:411 05-06 17:02:47.950486 xxxproc[1926672][1929350](erro)(netclient ncTransportClient.cpp:365connect): connect failed, server ip: 192.0.0.101 port: 9660, error id: 111, client id: 3 05-06 17:02:52.951297 xxxproc[1926672][1929350](erro)(netclient ncTransportClient.cpp:365connect): connect failed, server ip: 192.0.0.101 port: 9660, error id: 111, client id: 3 05-06 17:02:57.951818 xxxproc[1926672][1929350](erro)(netclient ncTransportClient.cpp:365connect): connect failed, server ip: 192.0.0.101 port: 9660, error id: 111, client id: 6 ......rootCJC-DB-02:/var/log/XXXBackup/XXClientService/AggregateApp/XXX_dmdba#cat xxxproc.log |grep backup database|more经检查只有5月6日17点发起了备份其他时间都是04:30发起的:...... 03-01 04:30:07.791004 xxxproc[856324][856661](info)(dmlogmoduleschedule ncDMThread.cpp:135): begin - dmdbManager-ExecuteCom mand (cmd: backup database full backupset CJCDB02_BAK_2026_3_1_04_30_07 device type tape parms 1772310607754844) 03-01 04:30:07.794611 xxxproc[856324][856661](info)(dmlogmoduleschedule ncDMCore.cpp:246): ncDMCore::ExecuteSqlCmd (cmd: bac kup database full backupset CJCDB02_BAK_2026_3_1_04_30_07 device type tape parms 1772310607754844) ...... 04-30 04:30:27.792102 xxxproc[1295746][1296122](info)(dmlogmoduleschedule ncDMThread.cpp:141): end - dmdbManager-ExecuteCom mand (cmd: backup database full backupset CJCDB02_BAK_2026_4_30_04_30_08 device type tape parms 1777494608091309) 05-06 17:02:23.916879 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMThread.cpp:135): begin - dmdbManager-ExecuteC ommand (cmd: backup database full backupset CJCDB02_BAK_2026_5_6_17_02_23 device type tape parms 1778058143883xxx) 05-06 17:02:23.920265 xxxproc[1926672][1927252](info)(dmlogmoduleschedule ncDMCore.cpp:246): ncDMCore::ExecuteSqlCmd (cmd: b ackup database full backupset CJCDB02_BAK_2026_5_6_17_02_23 device type tape parms 1778058143883xxx) 05-07 04:30:08.295221 xxxproc[2072719][2073130](info)(dmlogmoduleschedule ncDMThread.cpp:135): begin - dmdbManager-ExecuteC ommand (cmd: backup database full backupset CJCDB02_BAK_2026_5_7_04_30_08 device type tape parms 1778099408266038) 05-07 04:30:08.298617 xxxproc[2072719][2073130](info)(dmlogmoduleschedule ncDMCore.cpp:246): ncDMCore::ExecuteSqlCmd (cmd: b ackup database full backupset CJCDB02_BAK_2026_5_7_04_30_08 device type tape parms 1778099408266038)达梦备库备份作业日志:备份失败了但是备份进程一直尝试到 2026-05-07 06:08:21 才结束。vi dm_BAKRES_202605.log 2026-05-06 17:02:23 [CMD] database P0001927253 PPID4294967295 backup database full backupset CJCDB02_BAK_2026_5_6_17_02_23 device type tape parms 1778058143883xxx 2026-05-06 17:02:23 [CMD] database P0001927253 PPID4294967295 BACKUP DATABASE [CJCDB] 2026-05-06 17:02:23 [INFO] database P0001927253 PPID4294967295 CMD START.... 2026-05-06 17:02:23 [INFO] database P0001927253 PPID4294967295 BACKUP DATABASE [CJCDB],execute...... 2026-05-06 17:02:23 [INFO] database P0001927253 PPID4294967295 check limits of huge data 2026-05-06 17:02:23 [INFO] database P0001927253 PPID4294967295 CHECK LSN BEGIN ...... 2026-05-07 06:06:38 [INFO] dmap_br P0001927270 PPID4294967295 baker_cmd_process receive bakres cmd [CONTINUE] 2026-05-07 06:08:21 [ERROR] dmap_br P0001927270 PPID4294967295 baker_cmd_process failed to recive message by handle 5 2026-05-07 06:08:21 [INFO] dmap_br P0001927270 PPID4294967295 baker_cmd_process enqueue task BRTSK_TSK_COMPLETE 2026-05-07 06:08:21 [INFO] dmap_br P0001927270 PPID4294967295 baker_end bakres CMD [END] begin检查主库负载:问题时间段主库有跑批作业域控每天2:30同步数据港的数据到达梦库里大量的update等操作。问题原因:1.备库达梦数据库在 20260506 17:02:23 执行了backup database备份操作第三方备份工具显示备份失败并结束实际通过数据库作业日志查看备份进程并没有结束一直持续到20260507 06:06第三方备份在执行远程达梦备份时由于兼容性等原因导致备份写入磁带库异常缓慢执行时间过长备库陆续出现S锁等待最终触发达梦BUG宕机并自动重启重启后备份进程消失备库恢复正常。2.主库在20260507 02:00多 执行了跑批操作产生大量事务并伴随一些锁阻塞告警新数据在向备库进行同步时由于备库的backup database操作拖慢了备库的性能导致备库同步数据异常缓慢备库产生大量积压的事务没同步最终达到redos_buf_num阈值备库为了避免降低主库性能自动断开接收主库日志触发了主库的SUSPEND状态和远程同步归档状态变成 INVALID状态。3.备库2026-05-07 06:10:22 自动重启后恢复正常快速处理了积压的时候后主从同步恢复正常远程同步归档状态变回 VALID状态。序号 时间 事件 1 20260506 17:02:23 备库:发起备份 2 20260507 02:00 主库:发起跑批 3 20260507 02:22 备库由于持续的备份影响性能加上主库的跑批操作导致备库事务积压过多达到redos_buf_num参数阈值自动中断从主库同步日志 4 20260507 02:37 主库发现了备库的异常为了确保数据安全主库状态自动变为 SUSPEND检查到备库状态后又马上变回OPEN状态并由于备库不再接收主库的日志所以主库的远程归档状态变成了INVALID。 5 20260507 06:06 备库长时间无法完成备份操作自动触发实例重启的操作 6 20260507 06:10 备库重启后备份进程消失快速同步了积压的事务主从同步恢复正常解决方案:1.取消第三方备份工具对达梦数据库的接口备份改成文件级别备份2.建议降低跑并发数3.考虑调大REDOS_PARALLEL_NUM值指定日志并行重演的线程数加快备库日志重演的速度。