In Information systems, the biggest challenge faced by organizations is the quality of data. Hence, unclean, messy, and missing data is a common headache across the board. Furthermore, large organizations have multiple sources of data. This adds to the complexity of data duplication. And this is not simple duplication, i.e. exact match between records. Let’s say, for instance, consider 2 rows from different data stores:
Name | Surname | City | Zip |
Anand | Kulkarni | Bangalore | 560093 |
Anand | kulkarni | Bengaluru | 560093 |
Now, from intuition, we know that both the records are probably same. However, we cannot do a simple dedupe by using an exact match. This brings us to the domain of fuzzy matching using distance measures. For instance, the distance between surname field in two instances is 1 unit, i.e. the letter K. Hence, lower the distance, better the match. There are libraries in python that can achieve this. Read this article to know more.
However, this method faces a drawback. As the dimensionality of data grows, it becomes difficult to handle in terms of the sheer number of cases. So what’s the recourse, when using hand made rules becomes expensive? The answer lies in Machine Learning.
Active Learning for dedupe
Popularly, Machine Learning has been classified into Supervised and Unsupervised Learning. To recall quickly, Supervised techniques need labeled data for the model to learn from the ground truth. Whereas, unsupervised techniques do not require labels. Although much easy to interpret and deploy, Supervised Learning techniques need quality labels, which is an expensive undertaking. On the other hand, unsupervised techniques don’t need labels, but suffer from the challenge of interpretation and deployment.
In the middle comes Semi Supervised Learning. Since labeling data is a costly affair, Semi Supervised Techniques use a fraction of labeled data to generalize on the entire dataset. One class of semi-supervised learning is Active Learning, also called Online Learning. Active Learning interactively asks expert users for labeling. We won’t delve deep in to Active Learning in this article, since it is a huge area of study. We encourage readers to delve more into it. This is more of a hands-on tutorial.
Python Dedupe Library
Implementing deduplication using ML/Active Learning is not trivial. However, fortunately we have libraries that implement the same. One of them is the Python Dedupe library. Adding to the convenience of Data Scientists, there is a pandas version of the library called pandas_dedupe.
For this experiment, we will use the febrl dataset of record linkage library. Let’s get started.
1. Installation of libraries including pandas_dedupe.
Let’s install the following libraries:
- pandas_dedupe
- record_linkage
pip install recordlinkage pip install pandas_dedupe
2. Read febrl(Freely extensible biomedical record linkage) data.
This is a built-in dataset of the recordlinkage library. Let’s read the febrl data into a dataframe.
df_febrl = load_febrl1() df_febrl.head()
3. Train the dedupe model using Active Learning
The dedupe_dataframe method from pandas_dedupe performs active learning, and it takes the following mandatory parameters.
- Dataframe object.
- Columns of the Dataframe object passed.
Additionally, it takes the optional parameters like:
- canonicalize: To standardize the fields.
- sample_size: The sample size used for training. It ranges from 0 to 1 and the default value is 0.3 i.e. 30%.
- update_model: To update and existing model.
You can read more about these parameters here.
Here is the training code:
df_febrl_dedup = pandas_dedupe.dedupe_dataframe(df_febrl,df_febrl.columns ,canonicalize=True, sample_size=1)
Once you run the code, a prompt will appear that will ask for labels. It looks like this:
Here, we can clearly see that these two records are the same. Hence, in the text box, we type ‘y’ showing the duplication. Now, we will see an example of non-matching record.:
given_name : riley surname : siviur street_number : 15 address_1 : tubb plqace address_2 : roxor suburb : None postcode : 3186 state : nsw date_of_birth : 19080709 soc_sec_id : 4038966 given_name : riley surname : morcom street_number : 48 address_1 : newbery crescent address_2 : darjeeling suburb : grass valley postcode : 3156 state : nsw date_of_birth : None soc_sec_id : 5460376 12/10 positive, 0/10 negative Do these records refer to the same thing? (y)es / (n)o / (u)nsure / (f)inished / (p)revious
Lastly, after labeling sufficient examples(subjective), you can stop training.
4. Results
Once the training is finished, let’s get the cluster id and confidence from the trained results:
df_febrl_dedup_final = df_febrl_dedup[['given_name', 'surname', 'street_number', 'address_1', 'address_2', 'suburb', 'postcode', 'state', 'date_of_birth', 'soc_sec_id','cluster id','confidence']] df_febrl_dedup_final.sort_values(['cluster id']).head(50)
Following are the results:
Conclusion
Please note that this is for information purpose. We don’t claim any guarantees regarding completeness or accuracy of the content. We encourage you to perform this exercise and see the results for yourself!
Also Read: Careers in Machine Learning and AI