CSV Reconciliation Performance

This topic has 12 replies, 3 voices, and was last updated 5 years, 11 months ago by Jake Feasel.

  • Author
    Posts
  • #6164
     Matt Mencel
    Participant

    Currently my source data comes from CSV files with thousands of records. It takes over an hour to run a reconciliation on one file with only 14000 records in it. I have much bigger files still to process.

    I’ve tried doubling the Java Xmx and Xms settings to 2048m. That hasn’t helped that I can see.

    Anyone have any additional performance suggestions? I’ve got OpenIDM running on a 4 vCPU VM with 4GB of RAM. Storage is very fast.

    Thanks,
    Matt

    #6166
     Jake Feasel
    Moderator

    What version of OpenIDM are you working with?

    #6167
     Matt Mencel
    Participant

    Forgot to mention that… 3.1.0

    #6168
     Jake Feasel
    Moderator

    Ok, and which repo are you using? Have you altered your user model at all, or the policies for managed/users? It is worth noting the “unique” policy by default on the userName field, which involves a query against managed/user prior to each insertion. You will probably find that it dramatically increases performance to remove that “unique” policy, if you don’t need it during your bulk data loading.

    • This reply was modified 5 years, 11 months ago by Jake Feasel.
    #6170
     Matt Mencel
    Participant

    It’s actually a new managed object called MICROS. The csv file I’m testing with is inventory data. The __NAME__ field is ‘tagnum’. I found that this is set in the file provisioner.openicf-MICROS.json.

    “uniqueAttribute” : “tagnum”

    So I should remove that? Or is there a way to ignore that just during full reconciliation?

    Thanks,
    Matt

    #6171
     Jake Feasel
    Moderator

    No, that is fine – it is different than the “unique” policy. Which repo? How have you configured the table in the repo (generic vs explicit table, which columns are indexed, etc…).

    Also you may see improvements via increasing the “numThreads” option for your mapping.

    #6172
     Matt Mencel
    Participant

    OK…sorry about the confusion, I’m pretty new to this.

    I’m using MySQL. I looked through the repo.jdbc.json file and I’ve not created anything in there for my “MICROS” managed object. I see some stuff for managed/user and some defaults.

          "genericMapping" : {
                "managed/*" : {
                    "mainTable" : "managedobjects",
                    "propertiesTable" : "managedobjectproperties",
                    "searchableDefault" : true
                },
                "managed/user" : {
                    "mainTable" : "managedobjects",
                    "propertiesTable" : "managedobjectproperties",
                    "searchableDefault" : false,
                    "properties" : {
                        "/userName" : {
                            "searchable" : true
                        },
                        "/givenName" : {
                            "searchable" : true
                        },
                        "/sn" : {
                            "searchable" : true
                        },
                        "/mail" : {
                            "searchable" : true
                        },
                        "/accountStatus" : {
                            "searchable" : true
                        },
                        "/roles" : {
                            "searchable" : true
                        },
                        "/sunset" : {
                            "searchable" : true
                        }
                    }
                },
    

    So if anything, my managed/micros object is just using whatever the defaults are. There’s apparently more to creating a new managed object than just adding it through the UI. I have to setup the MySQL table info in the repo.jdbc.json file too?

    EDIT: I’m reading about explicit mappings now in the integrator’s guide. That’s probably what I need to do to speed things up…..instead of depending on the Generic Mapping where all the data is stuffed into a single field.

    Matt

    • This reply was modified 5 years, 11 months ago by Matt Mencel.
    #6174
     Matt Mencel
    Participant

    So I can add an explicit mapping to repo.jdbc.json, something like this…

    "explicitMapping" : {
                "managed/micros" : {
                    "table" : "micros",
                    "objectToColumn" : {
                        "_id" : "objectid",
                        "_rev" : "rev",
                        "tagnum" : "tagnum",
                        "serial" : "serial"
                    }
                },
    

    Do I need to specify every attribute from the CSV file? Or just the ones I might want to search/index on?

    Also, what then generates this table in the MySQL database? Do I have to do that too, or does this explicit mapping tell OpenIDM to go do that for me?

    Do I have to customize any of the SQL queries that are up in the explicit tables section?

    "explicitTables" : {
                "get-users-of-direct-role" : "select objectid from ${_dbSchema}.${_table} where find_in_set(${role},replace(substring(roles,2,(length(roles) - 2)),'\"',''))",
                "query-all-ids" : "SELECT objectid FROM ${_dbSchema}.${_table}",
                "query-all-ids-count" : "SELECT COUNT(objectid) AS total FROM ${_dbSchema}.${_mainTable} obj INNER JOIN ${_dbSchema}.objecttypes objtype ON obj.objecttypes_id = objtype.id WHERE objtype.objecttype = ${_resource}",
                "for-internalcredentials" : "select * FROM ${_dbSchema}.${_table} WHERE objectid = ${uid}",
    #6176
     Jake Feasel
    Moderator

    Whether the generic mapping will help is hard to say. If you have a large number of properties in the CSV, it is possible that serializing them all to the managedobjectproperties table could be the cause of some slowness. If there aren’t very many fields, then this is less likely to be the issue. Explicit tables will probably be at least marginally faster, in any case.

    For explicit tables, you have to create the table in your repo yourself. You also need to be sure that every field you want to populate is specified in your objectToColumn map – indexing is separate, and also up to you to define as appropriate for your table structure and query usage. You only have to customize the explicit queries if you need something other than what the defaults provide (though you probably could just use queryFilter queries in most cases).

    Do you have any other mappings which use this managed object as the source? If so, you are likely triggering a sync event after each insert, which (depending on the speed of the downstream resource being targeted) could slow the whole process down. You might find it useful to disable sync for these mappings while you are doing your bulk data-loading from CSV, and then do your recons to those systems after your managed objects are populated.

    #6178
     F3bio
    Participant

    Inside your conf/sync.json configuration which value have you set for taskThreads?
    A value of 0 specifies that reconciliation is run on the main reconciliation thread, that is, in a serial manner.

    You can try to increase this value and see if this improve your reconciliation performance.

    #6185
     Matt Mencel
    Participant

    I didn’t have taskThreads specified in the config for the managed object, so I believe it had used the default of 10 if I read the docs correctly. I decided to specify it and set it to 20.

    I created the table and setup the explicitMapping and all the objectToColumn maps. When I reconcile I can see data filling the new table in MySQL, so that works.

    It’s still pretty slow. Seems like the further along it gets in the reconciliation, the slower the inserts gets. It seems like what you said initially Jake, that maybe it’s verifying each insert to make sure it’s unique….and the further into the recon it gets the slower it gets because there’s a more data there to search through.

    I don’t think I have any “unique” policy set on this managed object, is it on by default? Where would I see if that is enabled?

    Does creating an INDEX on the MySQL attributes help (or hurt)?

    Thanks,
    Matt

    #6188
     Matt Mencel
    Participant

    …and maybe solved…

    https://backstage.forgerock.com/#!/docs/openidm/3.1.0/integrators-guide/chap-synchronization#recon-query-optimization

    Reconciliation of 13500 records takes 1 minute and 20 seconds. That’s MUCH better!

    I tried this earlier but didn’t get it to work….but then I realized that I was using the CSV source attribute “tagnum” instead of “__NAME__” which OpenIDM uses for the primary naming attribute. So now the query optimization works.

    From what I understand it loads everything on source/target into memory and then does the reconciliation.

    In my mapping I’ve added these….

    "targetQuery" : {
                    "_queryFilter" : "(tagnum sw \"\")"
    
    "sourceQuery" : {
                    "_queryFilter" : "(__NAME__ sw \"\")"
    
    #6189
     Jake Feasel
    Moderator

    Right, good idea. That is definitely going to improve things considerably. Glad you found a nice solution.

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic.

©2021 ForgeRock - we provide an identity and access platform to secure every online relationship for the enterprise market, educational sector and even entire countries. Click to view our privacy policy and terms of use.

Log in with your credentials

Forgot your details?