April 27, 2015 at 5:17 pm #4038
I’m trying to understand in what kind of situation will a Directory Server timed out while waiting for monitor data from a Replication Server as shown below.
[xx/xxx/2015:11:14:28 +0800] category=SYNC severity=SEVERE_WARNING msgID=14811242 msg=Timed out waiting for monitor data for the domain “cn=admin data” from replication server RS(30506)
[xx/xxx/2015:11:14:28 +0800] category=SYNC severity=SEVERE_WARNING msgID=14811242 msg=Timed out waiting for monitor data for the domain “cn=schema” from replication server RS(30506)
[xx/xxx/2015:11:14:28 +0800] category=SYNC severity=SEVERE_WARNING msgID=14811242 msg=Timed out waiting for monitor data for the domain “dc=openam-cts,o=xxxx” from replication server RS(30506)
[xx/xxx/2015:11:14:28 +0800] category=SYNC severity=SEVERE_WARNING msgID=14811233 msg=Directory server DS(25173) is closing its connection to replication server RS(29306) at xxxx/127.0.0.1:9999 for domain “cn=schema” because it could not detect a heart beat
[xx/xxx/2015:11:16:43 +0800] category=SYNC severity=NOTICE msgID=15139019 msg=Monitor data for the domain “cn=admin data” has been received from replication server RS(30506)
I do observe that the connection subsequently resumes automatically. But is there any parameter(s) that we can tune to avoid such a situation from happening? What I am experiencing now is that during the period when the heart beat is lost, the Directory Server becomes non-responsive until it attaches to a new Replication Server in the configured topology.
Any suggestion?April 27, 2015 at 6:10 pm #4041
Which version of OpenDJ ?
The replication service has a timeout of 5 seconds for getting monitor data, but there might be some contention when computing the data for several replicas. I believe we’ve fixed and improved this in OpenDJ 2.6.
LudoApril 28, 2015 at 5:45 am #4045
Chee ChongApril 29, 2015 at 9:37 pm #4066
Hi Chee Chong,
I cannot think of a specific tuning that can avoid this. However, this is not a normal situation. So I would suggest that you open a support ticket to have one of our engineer give a deeper look at logs and configuration, and possibly find a resolution.
LudoApril 30, 2015 at 2:26 am #4067
Will do. Thank you!
Chee ChongJune 25, 2015 at 3:52 pm #4539RBParticipant
Hi Chee Chong,
Did you find an answer for this ?
I’m seeing the same msgID=14811233.
Rob.June 29, 2015 at 3:16 am #4568
We moved the pair of CTS servers away from the pair of Config + User stores and the issue went away.
Chee ChongJune 29, 2015 at 4:31 pm #4586
One of the issue with monitoring data is that it’s rebuild for every read. When an instance is quite busy, rebuilding the monitoring data may take more time than replication expects (which is every 5 seconds).
There might be one option to reduce time of building the monitoring data, but I’m not sure it would be enough.June 30, 2015 at 3:43 am #4601
I think this scenario only happens on highly loaded environment. The same customer has another deployment that has a pair of CTS servers installed on the same boxes where a pair of Config + User stores reside. i.e. (1 x CTS) + (1 x Config + User Store) on Box A and (1 x CTS) + (1 x Config + User Store) on Box B.
As the load is not that high, no issue so far for the past 1 month since we cut over.
I am suspecting the hardware was not provisioned enough for the previous deployment with a much higher load. *ok, guessing only since customer does not allow me to test/debug further :> *
Chee ChongNovember 7, 2016 at 12:05 pm #14092Sarris OverboschParticipant
I know it is an old post, but I’m having the same messages in my replication log. As opposed to what is said above this system is not yet in use (in the process of installing) so I wonder how this is possible or better what I can do to solve this problem. Next to this message I also see messages about SSL handshake failing: Remote host closed connection during handshake.
So my first thought was that there was something wrong with the certficates etc, but then I cannot explain the message of “Late monitor data received from domain….” and “Monitor data for the domain … has been received from replication server RS(….)”
Can this behavior be caused by slow network connectivity?November 10, 2016 at 1:57 pm #14192
Yes, a slow network connectivity (like those typically found when testing with basic amazon VMs) can cause a lot of timeout messages with regards to Monitoring Data, as well as a few “Handshake errors”.November 9, 2018 at 3:31 pm #23830mvandenbergParticipant
I know this is a very old thread, but we’re running a very old version (2.6.2) and we have this issue.
We have a two node replication cluster which right now is generating one or two of these monitor data timeouts per day. It is always coming from node2 with the error referring to the RS on node1, and this seems a bit odd to me. In this case node1 is the primary node that all the updates go into, there isn’t any load balancing, just failover. These are bare metal CentOS 6 machines running one instance of OpenDJ on each. They are connected to the same switch so there’s no significant network overhead.
We’ve always seen these errors, and we have another 4 node configuration which is geographically diverse (2 nodes in one city and 2 nodes in another) and we see the errors on them too, but much more rarely than we’re seeing it now in the two node config. The rate of the errors increased about 3 weeks ago in the two node system for no apparent reason. All of these systems are very lightly loaded.
Anyone have any ideas on what the cause/fix might be?November 9, 2018 at 5:50 pm #23833
Sorry but OpenDJ 2.6.2 is very old (we’re about to release 6.5.0).
The warning is about a response message arriving later than expected. There are many possible causes, some legit, some probably due to issues with the code.
The fix is probably to upgrade to a much newer version. Replication is much faster, reliable and efficient with the latest releases.
You must be logged in to reply to this topic.