isMemberOf performance not scaling

This topic has 4 replies, 3 voices, and was last updated 5 months, 1 week ago by kcibul.

  • Author
  • #26938

    We are using DS 6.0 to host users, resources and policy groups (which group a collection of users into a policy on a resource). We have:

    60k users (people)
    ~500k policy groups (membership varies widely, typically < 100, but a few groups have 10,000s)
    ~100k resources (typically 4-5 policies per resource)

    We run a query to get all groups for a member. What we observe is that the query is slow, but that opendj isn’t able to keep up with 20 connections running the same query

    searchrate -p 389 -D "cn=Directory Manager" -w <SNIP> -F -c 20 -t 20 -b 'uid=203948203480239840,ou=people,dc=app-perf,dc=myorg,dc=org' -s base '(objectClass=*)' isMemberOf
    |     Throughput    |                 Response Time                |       Additional      |
    |    (ops/second)   |                (milliseconds)                |       Statistics      |
    |   recent  average |   recent  average    99.9%   99.99%  99.999% |  err/sec Entries/Srch |
    |      8.0      8.0 | 2691.553 2691.553  4278.19  4278.19  4278.19 |      0.0          1.0 |
    |      9.8      8.9 | 7387.723 5277.085  9797.89  9797.89  9797.89 |      0.0          1.0 |
    |      9.8      9.2 | 12573.557 7867.861 14696.84 14696.84 14696.84 |      0.0          1.0 |
    |     10.0      9.4 | 17817.248 10513.975 19998.44 19998.44 19998.44 |      0.0          1.0 |
    |      8.0      9.1 | 22560.430 12627.388 24159.19 24159.19 24159.19 |      0.0          1.0 |
    |     10.0      9.3 | 27347.099 15274.818 29527.90 29527.90 29527.90 |      0.0          1.0 |

    Really, we know a small number of groups we really want to assert membership in, so we tried using a filter instead but see the same scaling problem

    searchrate -p 389 -D "cn=Directory Manager" -w <SNIP> -F -c 20 -t 20 -b 'uid=203948203480239840,ou=people,dc=app-perf,dc=myorg,dc=org' -s base '(isMemberOf=policy=owner,resourceId=0f65252d-e79e-4ce7-8348-183eabbb6a42,resourceType=space,ou=resources,dc=app-perf,dc=myorg,dc=org)' dn
    |     Throughput    |                 Response Time                |       Additional      |
    |    (ops/second)   |                (milliseconds)                |       Statistics      |
    |   recent  average |   recent  average    99.9%   99.99%  99.999% |  err/sec Entries/Srch |
    |      6.0      6.0 | 3867.690 3867.690  4966.06  4966.06  4966.06 |      0.0          1.0 |
    |     10.0      8.0 | 7640.095 6225.443  9529.46  9529.46  9529.46 |      0.0          1.0 |
    |     10.0      8.7 | 12349.473 8580.839 14898.17 14898.17 14898.17 |      0.0          1.0 |
    |     11.6      9.4 | 17457.774 11319.468 19864.22 19864.22 19864.22 |      0.0          1.0 |

    We’re running on a 16-core server with 24-gb of ram. We see ~60-80% CPU load during this test, no disk reads, and just a few disk writes (probably logs etc)

    It’s not clear what’s happening in the evaluation of the virtual attribute, but it seems like it’s scanning a ton of data.

    Taking a JVM thread dump, a number of the worker threads are somewhere in this call stack (rooted at DistinguishedNameEqualityMatchingRuleImpl)

    at org.forgerock.opendj.ldap.Ava.delimitAndEvaluateEscape(
    at org.forgerock.opendj.ldap.Ava.readAttributeValue(
    at org.forgerock.opendj.ldap.Ava.readAttributeValue(
    at org.forgerock.opendj.ldap.Ava.decode(
    at org.forgerock.opendj.ldap.Rdn.decode(
    at org.forgerock.opendj.ldap.Dn.decode(
    at org.forgerock.opendj.ldap.Dn.valueOf(
    at org.forgerock.opendj.ldap.schema.DistinguishedNameEqualityMatchingRuleImpl.normalizeAttributeValue(
    at org.forgerock.opendj.ldap.schema.MatchingRule.normalizeAttributeValue(
    at org.forgerock.opendj.ldap.AbstractAttribute.normalizeValue(
    at org.forgerock.opendj.ldap.LinkedAttribute$MultiValueImpl.add(
    at org.forgerock.opendj.ldap.LinkedAttribute.add(
    at org.opends.server.backends.pluggable.ID2Entry$EntryCodec$2.decodeAttribute(
    at org.opends.server.backends.pluggable.ID2Entry$EntryCodec$2.decodeAttributes(
    at org.opends.server.backends.pluggable.ID2Entry$EntryCodec$2.decodeContent(
    at org.opends.server.backends.pluggable.ID2Entry$EntryCodec.decode(
    at org.opends.server.backends.pluggable.ID2Entry$EntryWrapperCodec.decodeV2(
    at org.opends.server.backends.pluggable.ID2Entry$EntryWrapperCodec.decode(
    at org.opends.server.backends.pluggable.ID2Entry.entryFromDatabase(
    at org.opends.server.backends.pluggable.ID2Entry.get0(
    at org.opends.server.backends.pluggable.ID2Entry.get(
    at org.opends.server.backends.pluggable.EntryContainer.getEntry0(
    at org.opends.server.backends.pluggable.EntryContainer.lambda$getEntry$10(
    at org.opends.server.backends.pluggable.EntryContainer$$Lambda$328/ Source)
    at org.opends.server.backends.pluggable.EntryContainer.getEntry(
    at org.opends.server.backends.pluggable.BackendImpl.getEntry(
    at org.opends.server.core.DirectoryServer.getEntry(
    at org.opends.server.core.CompareOperation.processCompare(
    at org.opends.server.core.CompareOperation.processLocalCompare(
    at org.opends.server.extensions.TraditionalWorkQueue$

    First of all, the searchrate parameters that you are using are flooding the directory with requests that are increasing the response time: -c 20 -t 20 is creating 20 asynchronous requests per connections, i.e. 400 of them. Since the machine only has 16 CPUs, that is significantly overloading it.

    But I think that having more groups (500K) than users (60K) is not what directory services was designed for and therefore is not specifically optimized for. There are surely a few ways to improve from what you are experimenting, but I do not know whether that will be enough to match your expectations.


    First of all – thank you so much for your response and time, it’s greatly apprecaited.

    That’s helpful to know about the connection rate. I’m curious how the async requests work multiplexed over a single connection, especially if the query time is long. I’d imagine they would queue up on the client side waiting to for a turn. Looking at the searchrate output, in addition to the slow response times we’re only getting ~10 ops/second so how much load (in terms of requests) do you think we’re really pushing?

    It’s good to hear your conclusion about the appropriateness of DS for this problem, and we’ve begun prototyping different data stores to that end. I’m basically trying to bridge the gap of performance problems we see in production today (where we get 30+ second response times for this query) until we can get there… so any possible hints/tweaks would be greatly appreciated.

    Knowing what’s happening in the code, would simply throwing more CPUs are the problem help as a temporary measure?

    Finally, when we reformulate the above query to be something like
    searchrate -p 389 -D "cn=Directory Manager" -w <SNIP> -F -c 20 -t 20 -b 'ou=people,dc=app-perf,dc=myorg,dc=org' -s base '(&(uid=203948203480239840)(isMemberOf=policy=owner,resourceId=0f65252d-e79e-4ce7-8348-183eabbb6a42,resourceType=space,ou=resources,dc=app-perf,dc=myorg,dc=org))' dn

    basically moving the uid into the filter. We mostly see the same results in searchrate. BUT we do have a few specific examples (where the user is a member of that group, so not a miss) where the performance is constant at ~3 seconds. The groups (fast vs slow) both have a single member. What other factors could be at play?

    Thanks again!

     Chris Ridd

    The stack traces suggest that you might benefit from adding the largest (but only the largest) static groups into an entry cache. There are examples of creating FIFO or Soft Reference entry caches in the docs.

    That will avoid DS repeatedly reading a group out of the database (or database cache), and decoding it into an Entry. Decoding attributes such as member with large numbers of values is relatively expensive and at some point dominate the search time.

    This will naturally cost some memory, so you may need to adjust your heap if you have lots of groups in your entry cache.


    Thanks – I’ll check that out. Is there a downside (other than $$) of running with a huge entry cache just to get us through this transition period? We’re running on Google Cloud, and the entire contents of the database are like 2gb on disk

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.

©2020 ForgeRock - we provide an identity and access platform to secure every online relationship for the enterprise market, educational sector and even entire countries. Click to view our privacy policy and terms of use.

Log in with your credentials

Forgot your details?