Python Dedupe Library : Machine Learning to De-Duplicate Data

3678

In Information systems, the biggest challenge faced by organizations is the quality of data. Hence, unclean, messy, and missing data is a common headache across the board. Furthermore, large organizations have multiple sources of data. This adds to the complexity of data duplication. And this is not simple duplication, i.e. exact match between records. Let’s say, for instance, consider 2 rows from different data stores:

Name Surname City Zip
Anand Kulkarni Bangalore 560093
Anand kulkarni Bengaluru 560093

Now, from intuition, we know that both the records are probably same. However, we cannot do a simple dedupe by using an exact match. This brings us to the domain of fuzzy matching using distance measures. For instance, the distance between surname field in two instances is 1 unit, i.e. the letter K. Hence, lower the distance, better the match. There are libraries in python that can achieve this. Read this article to know more.

However, this method faces a drawback. As the dimensionality of data grows, it becomes difficult to handle in terms of the sheer number of cases. So what’s the recourse, when using hand made rules becomes expensive? The answer lies in Machine Learning.

Active Learning for dedupe

Popularly, Machine Learning has been classified into Supervised and Unsupervised Learning. To recall quickly, Supervised techniques need labeled data for the model to learn from the ground truth. Whereas, unsupervised techniques do not require labels. Although much easy to interpret and deploy, Supervised Learning techniques need quality labels, which is an expensive undertaking. On the other hand, unsupervised techniques don’t need labels, but suffer from the challenge of interpretation and deployment.

In the middle comes Semi Supervised Learning. Since labeling data is a costly affair, Semi Supervised Techniques use a fraction of labeled data to generalize on the entire dataset. One class of semi-supervised learning is Active Learning, also called Online Learning. Active Learning interactively asks expert users for labeling. We won’t delve deep in to Active Learning in this article, since it is a huge area of study. We encourage readers to delve more into it. This is more of a hands-on tutorial.

Python Dedupe Library

Implementing deduplication using ML/Active Learning is not trivial. However, fortunately we have libraries that implement the same. One of them is the Python Dedupe library. Adding to the convenience of Data Scientists, there is a pandas version of the library called pandas_dedupe.

For this experiment, we will use the febrl dataset of record linkage library. Let’s get started.

1. Installation of libraries including pandas_dedupe.

Let’s install the following libraries:

  • pandas_dedupe
  • record_linkage
pip install recordlinkage
pip install pandas_dedupe

2. Read febrl(Freely extensible biomedical record linkage) data.

This is a built-in dataset of the recordlinkage library. Let’s read the febrl data into a dataframe.

df_febrl = load_febrl1()
df_febrl.head()

3. Train the dedupe model using Active Learning

The dedupe_dataframe method from pandas_dedupe performs active learning, and it takes the following mandatory parameters.

  • Dataframe object.
  • Columns of the Dataframe object passed.

Additionally, it takes the optional parameters like:

  • canonicalize: To standardize the fields.
  • sample_size: The sample size used for training. It ranges from 0 to 1 and the default value is 0.3 i.e. 30%.
  • update_model: To update and existing model.

You can read more about these parameters here.

Here is the training code:

df_febrl_dedup = pandas_dedupe.dedupe_dataframe(df_febrl,df_febrl.columns ,canonicalize=True, sample_size=1)

Once you run the code, a prompt will appear that will ask for labels. It looks like this:

Here, we can clearly see that these two records are the same. Hence, in the text box, we type ‘y’ showing the duplication. Now, we will see an example of non-matching record.:

given_name : riley
surname : siviur
street_number : 15
address_1 : tubb plqace
address_2 : roxor
suburb : None
postcode : 3186
state : nsw
date_of_birth : 19080709
soc_sec_id : 4038966

given_name : riley
surname : morcom
street_number : 48
address_1 : newbery crescent
address_2 : darjeeling
suburb : grass valley
postcode : 3156
state : nsw
date_of_birth : None
soc_sec_id : 5460376

12/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious

Lastly, after labeling sufficient examples(subjective), you can stop training.

4. Results

Once the training is finished, let’s get the cluster id and confidence from the trained results:

df_febrl_dedup_final = df_febrl_dedup[['given_name',
'surname',
'street_number',
'address_1',
'address_2',
'suburb',
'postcode',
'state',
'date_of_birth',
'soc_sec_id','cluster id','confidence']]

df_febrl_dedup_final.sort_values(['cluster id']).head(50)

Following are the results:

Conclusion

Please note that this is for information purpose. We don’t claim any guarantees regarding completeness or accuracy of the content. We encourage you to perform this exercise and see the results for yourself!

Also Read: Careers in Machine Learning and AI



I am a Data Scientist with 6+ years of experience.


Leave a Reply