How to delete or merge near-duplicate names using OpenRefine?

123 views Asked by At

I have 250,000 rows of first and last names. The first and last names are in separate columns, but they can be inconsistent e.g:

John Smith John-Smith John M. Smith Jhon Smith

How do I identify these near-duplicates and remove/merge them using openrefine?

I tried using sorting then using blank down, but it only appears to work for exact match duplicates.

2

There are 2 answers

0
b2m On

OpenRefine has implemented several clustering methods to identify and merge data. Check out the OpenRefine documentation for details.

Each clustering method has its own benefits and weaknesses so it usually is recommended to combine them and use several iterations.

In your case I would use the following workflow:

  1. Fingerprinting
  2. N-Gram Fingerprinting
  3. Maybe a phonetic algorithm (probably methaphone), but it might produce mainly false positives with names.
  4. Levenshtein
  5. Levenshtein with smaller block size and bigger distance
  6. PPM
0
magdmartin On

Building on @b2m answer, you can

  1. Create a new column by concatenating your first and last name
  2. Use the clustering function described by @b2m on the new column
  3. Once you have corrected all the spelling variations you can use the bank down to remove duplicates. See the detailed walkthrough on how to remove duplicate