Most data deduplication tools these days operate on the data while it's in your database. This is typically less intrusive than a full export/clean/import cycle (although I'm sure I'll get several angry emails from vendors that use that approach). Even so, deduping is hardly zero impact. In fact, it's a major pain if done incorrectly, because there's no "undo." Extra care is mandatory.
CRM Data Deduplication Requires Methodical Approach
Basically, the data deduplication cycle looks like this:
- Do a full system backup. (Yes, every time.)
- Do your work in the system sandbox first, if you can. Once you've validated the approach, the tools and the results, you'll have to do the entire cycle again in the production instance.
- Keep a thorough log of every step (including when you did it) so that you can precisely repeat it (and troubleshoot in case things go wrong).
- Normalize all the records of the table in question, particularly for things such as state and country codes, picklist values and other items where clean field data will improve the quality of duplicate detection and winner/loser determination.
- Identify fields that you'll want to preserve both the "winner" and "loser" values, such as phone numbers, email, stage/status, owner and record type. Both records might be correct, even though they are different values. Once you've identified these vulnerable and valuable fields, concatenate them together and put them in a new text field on for each record in the table. The concatenation can be done with an ETL tool or with code inside the system. Just make sure this step is complete and correct before you move on.
- Set up your data deduplication tool with the fields and matching criteria that best identify potential duplicate records. In the first iteration, you want the criteria to be very tight. Run this part of the tool and see how complete are the duplicate candidates.
- Set up the merge rules and processing scenario that will identify the "winners" within each set of duplicates. Make sure the text field you created (with those concatenations) will be merged with "append mode" so that the loser's data is preserved. Review the merge results with users to identify any gotchas.
- Perform a test merge with these tight criteria. See if there are any unintended consequences. You may need to wait a day to make sure that daily batch updates don't have problems with the merged records or re-create the duplicates in the course of external-system synchronization.
- When you're satisfied with this pass in the sandbox, run through the process in production. Do it once. (Yes, once.) Again, look for side effects after one day.
- Assuming all goes well in production, back up the entire system again.
- Repeat this process with looser criteria. Typically, each deduplication cycle will process about half the total duplicates. You may need to run through the entire process four times to get duplication down to an acceptable level.
Sign up for Computerworld eNewsletters.