Comparing datasets with dirty data to find common ro

28 views Asked by Ra Friction At 13 March 2024 at 13:37

I have two sets of customer data, from the shopify of two different companies. I have to compare these and mark customers according to if they appear in one or both datasets. The problem I have is that both datasets lack a common key column. In addition, they contain dirty data (i.e. variables with different spelling or spelling mistakes) so that other unique identifiers cannot be used as keys.

To show an example:

Dataset 1

ID	Fname	Sname	Mail	Company	City	Zip
123456	Jacky	Smith	[email protected]	Brewster	Birmingham	12345
000000	Richard	Smith	[email protected]	Mc Coffee	London	12334

Dataset 2

ID	Fname	Sname	Mail	Company	City	Zip
101010	Jacky	Smith	[email protected]	Brewster	Birmingham	12345
010101	Richard R.	Smith	[email protected]	McCoffee	London	12334

As you can see, there are spelling differences/mistakes that prevent me from using “Name”, “E-Mail” and “Company” as keys.

I was looking if there is a way that I can use a threshold to decide if a row is common to both sets - e.g. if 4 of 6 variables in a row are common to both companies, mark the row as being in both. My own idea to do it manually, was a script that would:

Read both dataset csv files
Loop through the longer dataset
For each instance, loop over the shorter dataset and compare for Mail and Company
If one of these is found to be common in both, compare for the other variables too
Count each common variable
If count reaches above threshold, copy the row into a new dataset
Export new dataset to a csv file

It looks simple enough to me, but I was wondering if there is a module/library I havent found yet, that does something similar or could help in these cases.

Original Q&A

TechQA.

Comparing datasets with dirty data to find common ro

There are 0 answers

Related Questions in PYTHON

Related Questions in CSV

Related Questions in DIRTY-DATA

Popular Questions

Trending Questions