Replication Server: Timed out waiting for monitor data

This topic has 12 replies, 5 voices, and was last updated 3 years, 10 months ago by Ludo.

  • Author
    Posts
  • #4038
     cheechong
    Participant

    I’m trying to understand in what kind of situation will a Directory Server timed out while waiting for monitor data from a Replication Server as shown below.

    [xx/xxx/2015:11:14:28 +0800] category=SYNC severity=SEVERE_WARNING msgID=14811242 msg=Timed out waiting for monitor data for the domain “cn=admin data” from replication server RS(30506)
    [xx/xxx/2015:11:14:28 +0800] category=SYNC severity=SEVERE_WARNING msgID=14811242 msg=Timed out waiting for monitor data for the domain “cn=schema” from replication server RS(30506)
    [xx/xxx/2015:11:14:28 +0800] category=SYNC severity=SEVERE_WARNING msgID=14811242 msg=Timed out waiting for monitor data for the domain “dc=openam-cts,o=xxxx” from replication server RS(30506)
    [xx/xxx/2015:11:14:28 +0800] category=SYNC severity=SEVERE_WARNING msgID=14811233 msg=Directory server DS(25173) is closing its connection to replication server RS(29306) at xxxx/127.0.0.1:9999 for domain “cn=schema” because it could not detect a heart beat
    :
    :
    [xx/xxx/2015:11:16:43 +0800] category=SYNC severity=NOTICE msgID=15139019 msg=Monitor data for the domain “cn=admin data” has been received from replication server RS(30506)

    I do observe that the connection subsequently resumes automatically. But is there any parameter(s) that we can tune to avoid such a situation from happening? What I am experiencing now is that during the period when the heart beat is lost, the Directory Server becomes non-responsive until it attaches to a new Replication Server in the configured topology.

    Any suggestion?

    #4041
     Ludo
    Moderator

    Hi Cheechong,

    Which version of OpenDJ ?
    The replication service has a timeout of 5 seconds for getting monitor data, but there might be some contention when computing the data for several replicas. I believe we’ve fixed and improved this in OpenDJ 2.6.

    Ludo

    #4045
     cheechong
    Participant

    Hi Ludo,

    2.6.2


    Chee Chong

    #4066
     Ludo
    Moderator

    Hi Chee Chong,

    I cannot think of a specific tuning that can avoid this. However, this is not a normal situation. So I would suggest that you open a support ticket to have one of our engineer give a deeper look at logs and configuration, and possibly find a resolution.

    Ludo

    #4067
     cheechong
    Participant

    Hi Ludo,

    Will do. Thank you!


    Chee Chong

    #4539
     RB
    Participant

    Hi Chee Chong,

    Did you find an answer for this ?
    I’m seeing the same msgID=14811233.

    Thanks,
    Rob.

    #4568
     cheechong
    Participant

    Hi Rob,

    We moved the pair of CTS servers away from the pair of Config + User stores and the issue went away.


    Chee Chong

    #4586
     Ludo
    Moderator

    One of the issue with monitoring data is that it’s rebuild for every read. When an instance is quite busy, rebuilding the monitoring data may take more time than replication expects (which is every 5 seconds).
    There might be one option to reduce time of building the monitoring data, but I’m not sure it would be enough.

    #4601
     cheechong
    Participant

    Hi Ludo,

    I think this scenario only happens on highly loaded environment. The same customer has another deployment that has a pair of CTS servers installed on the same boxes where a pair of Config + User stores reside. i.e. (1 x CTS) + (1 x Config + User Store) on Box A and (1 x CTS) + (1 x Config + User Store) on Box B.

    As the load is not that high, no issue so far for the past 1 month since we cut over.

    I am suspecting the hardware was not provisioned enough for the previous deployment with a much higher load. *ok, guessing only since customer does not allow me to test/debug further :> *


    Chee Chong

    #14092
     Sarris Overbosch
    Participant

    I know it is an old post, but I’m having the same messages in my replication log. As opposed to what is said above this system is not yet in use (in the process of installing) so I wonder how this is possible or better what I can do to solve this problem. Next to this message I also see messages about SSL handshake failing: Remote host closed connection during handshake.

    So my first thought was that there was something wrong with the certficates etc, but then I cannot explain the message of “Late monitor data received from domain….” and “Monitor data for the domain … has been received from replication server RS(….)”

    Can this behavior be caused by slow network connectivity?

    #14192
     Ludo
    Moderator

    Yes, a slow network connectivity (like those typically found when testing with basic amazon VMs) can cause a lot of timeout messages with regards to Monitoring Data, as well as a few “Handshake errors”.

    #23830
     mvandenberg
    Participant

    I know this is a very old thread, but we’re running a very old version (2.6.2) and we have this issue.
    We have a two node replication cluster which right now is generating one or two of these monitor data timeouts per day. It is always coming from node2 with the error referring to the RS on node1, and this seems a bit odd to me. In this case node1 is the primary node that all the updates go into, there isn’t any load balancing, just failover. These are bare metal CentOS 6 machines running one instance of OpenDJ on each. They are connected to the same switch so there’s no significant network overhead.

    We’ve always seen these errors, and we have another 4 node configuration which is geographically diverse (2 nodes in one city and 2 nodes in another) and we see the errors on them too, but much more rarely than we’re seeing it now in the two node config. The rate of the errors increased about 3 weeks ago in the two node system for no apparent reason. All of these systems are very lightly loaded.

    Anyone have any ideas on what the cause/fix might be?

    #23833
     Ludo
    Moderator

    Sorry but OpenDJ 2.6.2 is very old (we’re about to release 6.5.0).
    The warning is about a response message arriving later than expected. There are many possible causes, some legit, some probably due to issues with the code.
    The fix is probably to upgrade to a much newer version. Replication is much faster, reliable and efficient with the latest releases.

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic.

©2022 ForgeRock - we provide an identity and access platform to secure every online relationship for the enterprise market, educational sector and even entire countries. Click to view our privacy policy and terms of use.

Log in with your credentials

Forgot your details?