logo
logo
Sign in

A Detailed Guide to Using Entity Resolution Tools for Enterprise Projects

avatar
dataladder.com
A Detailed Guide to Using Entity Resolution Tools for Enterprise Projects

Dirty, unstructured structured data, dozen-plus name variations, and inconsistent field definitions across disparate sources. This can of worms is an almost staple occupational hazard for any data analyst working on a project involving thousands of records. And the implications are anything but ordinary:

 

What is Entity Resolution?

The book Entity Resolution and Information Quality describes entity resolution (ER) as ‘determining when references to real-world entities are equivalent (refer to the same entity) or not equivalent (refer to different entities)’.

In other words, it is the process of identifying and linking multiple records to the same entity when the records are described differently and vice versa.

For example, it asks the question: are data entries ‘Jon Snow’ and ‘John Snowden’ the same person or are they two different people entirely?

This also applies to addresses, postal and zip codes, social security numbers, etc.

ER is done by looking at the similarity of multiple records by checking it against unique identifiers. These are records that are least likely to change over time (such as social security numbers, date of birth, postal codes, etc.). Finding out if these records are the same or not involves matching it against a unique identifier in the following way:

 

In the above example, John Oneil, Johnathan O, and Johny O’neal are all matched through a unique identifier which is the national ID number.

ER usually consists of linking and matching data across multiple records to find possible duplicates and removing the matched duplicates which is why it is used interchangeably with:

How Entity Resolution Works in Practice

There are several steps involved in an ER activity. Let’s look at these in more detail.

Ingestion

This involves putting all data from multiple sources under one centralized view. An enterprise often has data scattered across disparate databases, CRMs, Excel and PDFs, and data formats including string, date, and both.

Profiling

After the data sources are imported, the next step is to check its health to identify any kind of statistical anomalies in the form of missing and inaccurate data and casing issues (i.e., lowercase and uppercase). Ideally, a data analyst will try to find potential problem areas that need to be fixed before doing any kind of data cleansing and entity resolving.

Here a user may want to check if the fields conform to RegEx – regular expressions that determine string types for different data fields. Based on this, the user can determine how many records are either unclean or don’t conform to a set encoding.

Doing so can help reveal crucial data statistics including but not limited to:

  • Presence of null values e.g., missing email addresses in lead gen forms
  • Number of records with leading and trailing spaces e.g. David Matthews
  • Punctuation issues e.g. hotmail,com instead of Hotmail.com
  • Casing issues e.g. nEW yORK , dAVID mATTHEWS, MICROSOFT
  • Presence of letters in numbers and vice versa e.g. TEL-516 570-9251 for contact number and NJ43 for state.

Deduplication and Record Linking

Through matching, multiple records that are potentially related to the same entity are joined to remove duplicates, or deduplicated using unique identifiers. The matching techniques can vary depending on the type of field such as exact, fuzzy, or phonetic.

Canonicalization

Canonicalization is another key step in ER where entities that have multiple representations are converted into a standard form. It involves taking the most complete info as the final record and leaving out outliers or noisy data that could distort the data.

Blocking

When finding matches for an entity across hundreds and thousands of records, the potential combinations that could yield the right matches can end up in thousands (if not millions). To avoid this problem, blocking is used to limit the potential pairings using specific business rules.

4 Reasons Why Entity Resolution Tools Are Better

Entity resolution tools can provide many benefits that traditional ER can’t. These include:

1. Greater Match Accuracy

Dedicated entity resolution tools that have sophisticated fuzzy matching algorithms and entity resolving capabilities in place can give far better record linking and deduplication results than common ER algorithms

2. Lower Time-To-First Result

In most cases, time is critical for ER projects especially in the case of master data management (MDM) initiatives that require a single source of truth. The information relating to an entity can quickly change within weeks or months that can pose serious data quality risks.

3. Better Scalability

Entity resolution tools are far more adept at ingesting data from multiple points and run record linkage, deduplication, and cleansing tasks at a much larger scale.

4. Cost-savings

Entity resolution tools, particularly for enterprise-level applications, can cost a sizable investment. Data professionals tasked with ER may be reluctant to consider opting for this reason alone.

How to Choose the Right Entity Resolution Software

Choosing the right entity resolution software is equally important. Many entity resolution tools differ in their features, scope, and value. Enterprises can have data stored in a wide variety of formats and sources such as Excel, delimited files, web applications, databases, and CRMs. An entity resolution software must be capable of importing data from disparate sources for the specific use case.


Originally posted at https://datafloq.com/read/detailed-guide-using-entity-resolution-tools-enterprise-projects/


collect
0
avatar
dataladder.com
guide
Zupyak is the world’s largest content marketing community, with over 400 000 members and 3 million articles. Explore and get your content discovered.
Read more