Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

How to solve CRM data deduplication dilemmas

David Taber | Oct. 2, 2013
Unlike other enterprise software systems, the biggest data quality issue that CRM systems face is duplicates. Dedupe incorrectly, though, and bad data poltergeists and doppelgangers will come back to haunt you.

Unless you have an explicit data deduplication strategy in place, your CRM system is almost certain to have some level of duplicate records. There are a few reasons why:

  • Humans may not search for records before adding new contacts, leads or accounts. Even if your CRM system has some sort of duplicate-alert system, not everyone will pay attention to it - and it may not work on mobile devices anyway.
  • Data import tools may not successfully identify duplicate records or may import them anyway, though some use a "potential duplicate" flag.
  • Integrations with outside sources such as website registration forms, partner portals, message brokers or other applications may not query the CRM data before inserting new records. In some situations, even if the external source detects a duplicate, it can't update the existing record and must create a new one.
  • Several types of administrative errors and software bugs - both inside the CRM system and in associated applications - can rapidly produce thousands of duplicate records.

There's nothing wrong with having 2 percent duplicate records, as long as they are short-lived and your tools and processes detect and correct them. Get beyond 5 percent or so, though, and users will start to complain. Reports will become misleading. Data updates will get lost, since users won't be able to find the changes they made just the other day. System credibility will start to plummet. Systems with 25 percent duplicate records, moreover, can threaten careers.

Detect, Correct Data Duplicates Takes Time

As with any type of data pollution, the correction cycle will take a while. You need to develop a methodical get-well plan with expectations set correctly for budget and schedule.

Start by using the best available duplicate detection tools that warn users while they enter data. Before you deploy any tool, test that it doesn't throw such severe errors that your integrations with external systems are blocked.

Next, analyze the systemic sources of duplicate records. Narrow them down to one or two external systems, and get part of the team working to cut down the inflow of new duplicate records. Rinse, lather and repeat as necessary.

While that's going on, analyze the heuristics of the duplication. Hopefully only one or two tables are involved, but remember that each table will have its own (set of) patterns. Since data deduplication requires an iterative approach, you'll want to get the team members familiar with the tools and processes on the lowest-risk tables. "Lowest risk" varies a lot, but leads or activities are typically the best candidates, as they tend to have the fewest number of other records pointing to them.

In analyzing the heuristics of duplicates, you need to understand four things:

  • The most reliable way to detect a potential duplicate "pair;"
  • The best way to identify the "winner" in the merge cycle;
  • What parts of the "loser" record need to be preserved, and
  • How to deal with pointers to and from the loser record.


1  2  3  Next Page 

Sign up for Computerworld eNewsletters.