How to delete or merge near-duplicate names using OpenRefine?

Question

123 views Asked by Ahmed Rehman At 19 April 2023 at 03:23

I have 250,000 rows of first and last names. The first and last names are in separate columns, but they can be inconsistent e.g:

John Smith John-Smith John M. Smith Jhon Smith

How do I identify these near-duplicates and remove/merge them using openrefine?

I tried using sorting then using blank down, but it only appears to work for exact match duplicates.

There are 2 answers

**b2m** · Answer 1 · 2023-04-20T05:47:29+00:00

OpenRefine has implemented several clustering methods to identify and merge data. Check out the OpenRefine documentation for details.

Each clustering method has its own benefits and weaknesses so it usually is recommended to combine them and use several iterations.

In your case I would use the following workflow:

Fingerprinting
N-Gram Fingerprinting
Maybe a phonetic algorithm (probably methaphone), but it might produce mainly false positives with names.
Levenshtein
Levenshtein with smaller block size and bigger distance
PPM

**magdmartin** · Answer 2 · 2023-04-24T12:55:12+00:00

Building on @b2m answer, you can

Create a new column by concatenating your first and last name
Use the clustering function described by @b2m on the new column
Once you have corrected all the spelling variations you can use the bank down to remove duplicates. See the detailed walkthrough on how to remove duplicate