Performing a join between dataframes with fuzzy matching without iterrows?

89 views Asked by At

I have looked around a bit, but have not found a similar question, forgive me if I missed something.

Using pandas, I am trying to write a function to merge 2 dataframes : df_ref and df_to_merge. df_ref is typically very large (500 K lines, 50 columns), but is consistently the same for all function calls. df_to_merge is typically much smaller and in its columns is a small subset of df_ref's columns. It varies from call to call.

I want to join both dataframes on some columns that have to match exactly (listed in col_exact), and also on some columns where I allow some fuzzy matching (listed in col_approx). My approach is, for each row to merge, to compute a similarity score with a row from ref_df and keep the row with the highest similarity score. This similarity score is computed column by column. Each of the similarity scores by column has to be higher than some threshold, otherwise it is discarded. The comparison method for approximate columns vary based on the column name, but always return a number between 0 (not similar) and 1 (same inputs), and is in general some form of edit distance or a simple distance between numbers.

I am for now using df.iterrows() on both dataframes, which is extremely slow, and wanted to know if there was a faster solution. Here is my code for now. I try to filter columns as soon as possible and throw out mismatched rows as early as possible, but it is still running pretty slowly.

def merge_dfs(df_to_match, df_ref, col_exact, col_approx):
    """Merges both dataframes. col_exact are name of columns that have to match exactly, col_approx approximately"""

    for index, row in df_to_match.iterrows():
        best_matches = []

        # filtering df_ref for columns that have to be matched exactly
        filtered_df = df_ref.copy()
        for column_name in col_exact:
            filtered_df = filtered_df[filtered_df[column_name] == row[column_name]]

        # filtering df_ref for columns that have to be matched approximately
        for index2, row2 in filtered_df.iterrows():
            # For each column on which to merge approximately, we check if we pass the threshold.
            similarity_scores_row = {}
            for column_name in col_approx:
                check_similarity = comparison_methods[column_name]
                similarity_score = check_similarity(row[column_name], row2[column_name])

                if similarity_score >= similarity_thresholds[column_name]:
                    similarity_scores_row[column_name] = similarity_score
                else:
                    break
            # If all approximate columns have passed the threshold test, we can keep this one
            if len(similarity_scores_row) == len(col_approx):
                overall_score = sum(similarity_scores_row.values()) / len(col_approx)
                best_matches.append((row2, overall_score))  # Keeping this row, with its score
                if overall_score == 1: # We found an optimal match
                    break
        # Getting the best row among all those found
        if best_matches:
            matches_found += 1
            best_match = max(best_matches, key=lambda t: t[1])[0]
            for col_name in df_ref.columns:
                if col_name not in col_approx:
                    # Updating info of the row with best match found
                    df_to_match.at[index, col_name] = best_match[col_name]

This is working as intended, but I feel like this is not how I should be using pandas, and there is probably a faster way to do it.

EDIT :

I am not sure I can provide my actual data, but here is a dummy example, where col_exact is empty for simplicity's sake:

Example dataframes :

df_ref = pd.DataFrame({'A':[1,2,3,4], 'B':['foo', 'bar', 'baz', 'qux'], 'C':[100, 256, 32, 44]})
df_to_merge = pd.DataFrame({'Name' : ['Paul', 'John', 'Ringo', 'Georges'], 'A':[0.7,1.6,4.2, 2], 'B':['floo', 'baz', 'qux_', 'baar'] })

They look like this :

   A    B    C
0  1  foo  100
1  2  bar  256
2  3  baz   32
3  4  qux   44
      Name    A     B
0     Paul  0.7  floo
1     John  1.6   baz
2    Ringo  4.2  qux_
3  Georges  2.0  baar

And then :

from Levenshtein import distance
    comparison_methods = {'A': lambda x,y: 1 - abs(x-y)/(x+y), 'B': distance }
    similarity_thresholds = {'A': 0.7, 'B': 0.6}
    col_exact =[]
    col_approx = ['A', 'B']

    merged_df = merge_dfs(df_to_merge, df_ref, col_exact, col_approx)

And the result is :

Name    A     B      C
0     Paul  0.7  floo  100.0
1     John  1.6   baz  100.0
2    Ringo  4.2  qux_   32.0
3  Georges  2.0  baar  256.0 

Thank you for your answers !

0

There are 0 answers