Compare, partial match and find if a team name in one CSV file exists in another CSV file

56 views Asked by At

I use Python. There are two CSV files, the team name column in the first CSV file “PES TEAM 2021 A” is the reference. By using the reference, compare and find team name in the second CSV file, find if the team names in the CSV file 2022 exists in the first reference file 2021.

PES TEAM 2021 A

TEAM year 2021 print screen

CSV file 2022 A

TEAM year 2022 print scree

I use the fuzzy string match to do the partial match. For example, team “Manchester United” in file “2022” matches “Manchester United FC” in file “2021”. And then label “TRUE” beside “Manchester United” in file “2022”. Another example is “Alanyaspor” in file “2022 matches “Aytemiz Alanyaspor” in file “2021”. And then label “TRUE” beside “Alanyaspor” in file “2022”. Another example is “Ajax Amsterdam” in file “2022 matches “AFC Ajax” in file “2021”. And then label “TRUE” beside “Ajax Amsterdam” in file “2022”.

However, when doing matching, my output is all true in the picture below:

incorrect output-all TURE

I tries this script below.

However, when doing matching, my output is all true in the picture below:

enter image description here

import pandas as pd
from fuzzywuzzy import fuzz

# Replace 'reference.csv' and 'second_file.csv' with your actual file names
 reference_file_path = r'C:\Users\pes TEAM DEF\PES TEAM 2021 A.csv'
 second_file_path = r'C:\Users\pes TEAM DEF\PES TEAM 2022 A.csv


  # Read the reference and second CSV files
 df_reference = pd.read_csv(reference_file_path)
 df_second = pd.read_csv(second_file_path)

    # Extract unique team names from the first CSV file

unique_team_names = set(df_reference['2021Name'].str.lower())

# Function to check if the similarity is more than the threshold
def is_similarity_above_threshold(name, reference_names, iterabl_team_names, threshold):
    return any(fuzz.partial_ratio(name, reference_names) > threshold for name in iterabl_team_names)
    
# Add a new column to the reference file indicating if there's a 70% partial match
df_second['PartialMatch'] = df_second['2022Name'].str.lower().apply(
    lambda x: is_similarity_above_threshold(x, unique_team_names, second_team_names, 70)
)

# Display or further process the DataFrame with highlighted duplicates
    #print(df_second[df_second['is_duplicate']])
output_path = "C:/Users/Downloads/pes TEAM DEF/output/abcd.csv" 
df_second.to_csv(output_path, index=False, encoding='utf-8-sig') #add utf-8-sig or japanese and latin cannot display normally  
0

There are 0 answers