I have two sets of customer data, from the shopify of two different companies. I have to compare these and mark customers according to if they appear in one or both datasets. The problem I have is that both datasets lack a common key column. In addition, they contain dirty data (i.e. variables with different spelling or spelling mistakes) so that other unique identifiers cannot be used as keys.
To show an example:
Dataset 1
| ID | Fname | Sname | Company | City | Zip | |
|---|---|---|---|---|---|---|
| 123456 | Jacky | Smith | [email protected] | Brewster | Birmingham | 12345 |
| 000000 | Richard | Smith | [email protected] | Mc Coffee | London | 12334 |
Dataset 2
| ID | Fname | Sname | Company | City | Zip | |
|---|---|---|---|---|---|---|
| 101010 | Jacky | Smith | [email protected] | Brewster | Birmingham | 12345 |
| 010101 | Richard R. | Smith | [email protected] | McCoffee | London | 12334 |
As you can see, there are spelling differences/mistakes that prevent me from using “Name”, “E-Mail” and “Company” as keys.
I was looking if there is a way that I can use a threshold to decide if a row is common to both sets - e.g. if 4 of 6 variables in a row are common to both companies, mark the row as being in both. My own idea to do it manually, was a script that would:
- Read both dataset csv files
- Loop through the longer dataset
- For each instance, loop over the shorter dataset and compare for Mail and Company
- If one of these is found to be common in both, compare for the other variables too
- Count each common variable
- If count reaches above threshold, copy the row into a new dataset
- Export new dataset to a csv file
It looks simple enough to me, but I was wondering if there is a module/library I havent found yet, that does something similar or could help in these cases.