Opendj cleaner running continuously

This topic has 4 replies, 2 voices, and was last updated 3 years, 2 months ago by tcooper.

  • Author
    Posts
  • #17492
     tcooper
    Participant

    I am working with a 4 node replication mesh. All of the nodes have gotten into a state where the checkpoint/cleaner process is running continuously generating new files. On a node where nothing has been done to correct the issue it has been running for 9 days and it causing a load average on the system of around 25, with a lot of I/O.
    On a node where a restart of the ldap service was done to try to correct the issue, it is still running at high load for 6 days.
    On a node where the system was rebooted the high load came up for about 4 hours after which the load returned to normal and the cleaner was operating normally.
    I will be rebooting the remaining nodes shortly to see if that will clear this, but I’m looking for a root cause which isn’t clear from the evidence so far.

    Is this something anyone has seen?

    Thanks,

    Terry

    • This topic was modified 3 years, 2 months ago by tcooper.
    #17495
     JnRouvignac
    Participant

    Which OpenDJ version are you running?

    This is a known issue with JE.
    See https://bugster.forgerock.org/jira/browse/OPENDJ-3283 for example

    #17515
     tcooper
    Participant

    We’re running 2.6.2.
    The problem above doesn’t seem to line up with our symptoms. In our case there is no disk space growth, except if we try to run a backup which doesn’t stop until it runs out of space due to the cleaner creating roughly 200 files per hour.
    The cleaner appears to be doing exactly what it is supposed to, but it just doesn’t ever stop doing it. It never seems to catch up with what it needs to clean.
    We have a high I/O wait on this system so I’m speculating that the cause may simply be slowness in the I/O. Having said that though the systems ran normally for a long time and then one by one over a period of a couple of weeks they all went into this state of high load average due to continuous cleaner operation.

    Thanks,

    Terry

    #17549
     tcooper
    Participant

    Further data on this:

    The one system that is currently running normally and is the primary active system is generating 5-15 clean operations per hour. The highest in recent days was 29.
    The systems which have the problem are doing 300-400 clean operations per hour. At this time all of these systems are seeing only replication traffic from the primary system.

    Terry

    #17598
     tcooper
    Participant

    So the problem has been found.

    The systems which had this issue were originally installed with an older version of openDJ and subsequently upgraded to 2.6.2. As a result of some other load issues we decided to increase the file size from the older default of 10M to 100M and also adjusted some internal cache parms. This resolved the issue we were having. That was roughly a month ago and now we have the problem described above.
    It appears that the issue is the checkpoint value was at the old default of 20MB which seems to not work well with 100MB files. Using the 2.6.2 default of 500MB seems to resolve the issue.

    Terry

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.

©2020 ForgeRock - we provide an identity and access platform to secure every online relationship for the enterprise market, educational sector and even entire countries. Click to view our privacy policy and terms of use.

Log in with your credentials

Forgot your details?