Replication:The replication agent has not logged a progress message in 10 minutes.

打开Replication Monitor,在Subscription Watch List Tab中,发现有大量的status= “Performance critical” 的黄色Warning,Latency 非常高,第六感告诉我,出事了,无法求援,只能强迫自己淡定下来,既来之,则安之。

1,查看status= “Performance critical” Subscription的Detail,在Distributor to subscriber history tab中查看Action messages。

时间最早的Action Message 详细信息如下:

The replication agent has not logged a progress message in 10 minutes. This might indicate an unresponsive agent or high system activity. Verify that records are being replicated to the destination and that connections to the subscriber. Publisher and distributor are still active.

原因是The replication agent 在10 分钟里没有记录一个progress message,不是很明白。

从MSDN上查到 MSSQL_ENG020554 跟上述Action Message 的描述相同

Message Details 

The replication agent has not logged a progress message in %ld minutes. This might indicate an unresponsive agent or high system activity. Verify that records are being replicated to the destination and that connections to the Subscriber, Publisher, and Distributor are still active.

Explanation   

The Replication agents checkup job runs at a specified interval (10 minutes by default) to check on the status of each replication agent. If an agent has not logged any progress messages since the last time the agent checkup job ran, error MSSQL_ENG020554 can be raised. The agent is expected at least to log history messages even if no other replication activity is occurring. Although the replication agent is not responding as expected, it has not necessarily stopped or failed (if an agent has failed, error MSSQL_ENG020536 should be raised).

The following issues can cause error MSSQL_ENG020554 to be raised:

  • The agent is busy.

    If the agent is too busy to respond when polled by the agent checkup job, the agent checkup job cannot report whether the replication agent is functioning properly. There are a number of reasons why the replication agent could be busy: there might be a lot of data being replicated, or there might be application design or configuration issues that result in processes that run for a long time.

  • The agent cannot log in to one of the computers in the topology.

    All agents have a parameter -LoginTimeOut (set to 15 seconds by default), which governs how long an agent attempts to log in to a replication node, such as a Merge Agent logging in to the Publisher. If the -LoginTimeOut value is set higher than the interval at which the replication agent checkup job runs, a login problem could be the root cause of the error: error MSSQL_ENG020554 is raised before the agent is able to raise a more specific error.

我想起来了,前一天,我更新了一张大表dbo.dt_Test,更新的数据量大概有1.6 亿条,可能是Replication推送的数据量过大,导致agent 太busy。

查看DW正在运行的语句,发现sql server 正在执行一个Replication的sp :sp_MSupd_dbodt_test ,该sp用于更新表dbo.dt_Test。

Root cause 找到后,Google了一下,找到解决方案:

This error message gets generated because of the Distribution heartbeat interval property. This property governs how long an agent can run without logging a progress message. If your replication agents are not reporting an error message and you are seeing the above message, then you could change your heartbeat interval to a higher value. One of the option could be that you changed the history logging option for your replication agent so that it doesn’t log any message.

exec sp_changedistributor_property 
        @property = 'heartbeat_interval', 
        @value = <value in minutes>;

USE master
GO

exec sp_changedistributor_property 
        @property = 'heartbeat_interval', 
        @value = 30;

set 30 minutes and check. if you still face the issue change it to 60 minutes

 

参考文档:

posted @ 2015-12-25 15:04  悦光阴  阅读(2060)  评论(0编辑  收藏  举报