May 29, 2017 at 4:00 pm #17492
I am working with a 4 node replication mesh. All of the nodes have gotten into a state where the checkpoint/cleaner process is running continuously generating new files. On a node where nothing has been done to correct the issue it has been running for 9 days and it causing a load average on the system of around 25, with a lot of I/O.
On a node where a restart of the ldap service was done to try to correct the issue, it is still running at high load for 6 days.
On a node where the system was rebooted the high load came up for about 4 hours after which the load returned to normal and the cleaner was operating normally.
I will be rebooting the remaining nodes shortly to see if that will clear this, but I’m looking for a root cause which isn’t clear from the evidence so far.
Is this something anyone has seen?
May 30, 2017 at 12:09 pm #17495JnRouvignacParticipant
- This topic was modified 3 years, 2 months ago by tcooper.
Which OpenDJ version are you running?
This is a known issue with JE.
See https://bugster.forgerock.org/jira/browse/OPENDJ-3283 for exampleMay 30, 2017 at 3:13 pm #17515
We’re running 2.6.2.
The problem above doesn’t seem to line up with our symptoms. In our case there is no disk space growth, except if we try to run a backup which doesn’t stop until it runs out of space due to the cleaner creating roughly 200 files per hour.
The cleaner appears to be doing exactly what it is supposed to, but it just doesn’t ever stop doing it. It never seems to catch up with what it needs to clean.
We have a high I/O wait on this system so I’m speculating that the cause may simply be slowness in the I/O. Having said that though the systems ran normally for a long time and then one by one over a period of a couple of weeks they all went into this state of high load average due to continuous cleaner operation.
TerryJune 1, 2017 at 2:18 pm #17549
Further data on this:
The one system that is currently running normally and is the primary active system is generating 5-15 clean operations per hour. The highest in recent days was 29.
The systems which have the problem are doing 300-400 clean operations per hour. At this time all of these systems are seeing only replication traffic from the primary system.
TerryJune 6, 2017 at 1:23 pm #17598
So the problem has been found.
The systems which had this issue were originally installed with an older version of openDJ and subsequently upgraded to 2.6.2. As a result of some other load issues we decided to increase the file size from the older default of 10M to 100M and also adjusted some internal cache parms. This resolved the issue we were having. That was roughly a month ago and now we have the problem described above.
It appears that the issue is the checkpoint value was at the old default of 20MB which seems to not work well with 100MB files. Using the 2.6.2 default of 500MB seems to resolve the issue.
You must be logged in to reply to this topic.