Home Big Data Motivating Entity Resolution for Data Science

Big Data Data Science

Motivating Entity Resolution for Data Science

By Prasad KulkarniOct 09, 2021, 02:59 am0

1190

Why Entity Resolution?

Data is the new oil. Thus, analytical models are the new combustion engines. A combustion engine functions efficiently with good fuel. Similarly, for a model to output sensible results, quality data is imperative. Hence, the data needs to go through the refinements. This has led to a revolution in academia, called Data Centric AI, led by the iconic Prof Andrew Ng.

Nevertheless, Enterprise Data Science is significantly driven by relational databases. But, as organizations grow, Information systems get siloed, leading to duplicates. Besides, quality issues, schema variations and disparate data collection traditions add to the ambiguity.

In a DBMS, an entity is a real-world object like a customer. Overtime, this customer entity can have multiple versions. Within the same database, the entity may have multiple records with different hospital types, addresses, etc.

Further, across databases, the entity may vary in structure and semantics. So, how do we reconcile these variations? The answer is Entity Resolution.

What is Entity Resolution?

Entity Resolution is disambiguation of records that correspond to entities in the real world, within and across databases. The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalizations:

Deduplication: To identify duplicate data within the same source.
Record linkage: To identify records that reference the same entity across sources.
Canonicalization: To convert data with multiple representation into a standard form.

Having said that, let’s take an example of Customer Records in databases D1 and D2.

Database D1:

Let’s take an example of customer address records:

Customer Name	Address1	City	State	Zip
Aarogya Health	3rd cross, MG Road	Bengaluru	Karnataka	560093
aarogya Health	3rd Cross, MG Rd	Bangalore	KN	560093

We can make out that they refer to the same record. However, there are minor variations in all the columns(except zip).

Database D2:

Let’s take an example of customer Contact records:

Customer Name	Website	Email	Contact
Aarogya Health	aarogyahealth.com	[email protected]	Dr ABC
aarogya Health	aarogyahealth.com	[email protected]	Dr AB

The two records point to the same Customer.

In these examples, identifying and marking the similarity between records in the same database, either D1 or D2,is deduplication. Furthermore, identifying and marking the similarity between records across database D1 and D2 is Record Linkage.

Lastly, we can see that one contact name is in the lowercase, while the other is in Camel Case. Moreover, the Address1 varies in case and short forms. Bringing all the records to one standard form (e.g.lower case etc.) is called Canonicalization.

How to perform Entity Resolution?

With such minor variations in data, it is difficult to find duplicates within or across database(s). Moreover, this problem aggravates as the scale of data grows. Hence, rule based engines are infeasible to build.

Fortunately, with Machine Learning, probabilistic entity matching is a possibility. Having said that, we strongly recommend you to read our article using Machine Learning to De-Duplicate Data. This is a hands-on tutorial for Deduplication using Active Machine Learning, using pandas-dedupe library. Additionally, to read more about Record Linkage using the same library, refer to this link. Notably, this implementation does not scale well.

Conclusion

Finally, this is not a comprehensive guide to Entity resolution, since it is a big subject. Hence, we will expand upon this topic in the future. Also, please note that this is only for information. We do not claim any guarantees regarding its accuracy or completeness. Note that any names that occur here are purely imaginary. Any resemblance of names and places is purely co-incidental.

TAGdeduplication machine learning

Previous PostAn Introduction to Interpretable Machine Learning with LIME and SHAP Next PostFast data loading from Pandas Dataframe to Azure SQL Database

Prasad Kulkarni

I am a Data Scientist with 6+ years of experience.

Motivating Entity Resolution for Data Science

Why Entity Resolution?

What is Entity Resolution?

How to perform Entity Resolution?

Conclusion

Prasad Kulkarni

Leave a ReplyCancel reply

Follow Us

Motivating Entity Resolution for Data Science

Why Entity Resolution?

What is Entity Resolution?

How to perform Entity Resolution?

Conclusion

Prasad Kulkarni

Related articles

My Data Science Journey

An Introduction to Modeling Mindsets

Feature Scaling in Applied Machine Learning

Leave a ReplyCancel reply

Follow Us

Most used tags