I am using fuzzywuzzy and rapidfuzz to find names mentioned in comments. I read through the documentation of the "token_set_ratio" function but I still don't understand the following:
# I preprocessed the comments to remove stop words and commonly mentioned other words
fuzz.token_set_ratio("reporting michael anders sven straumann guy called jonatjan smith partners","jonathan smith")
# returns 52.6
Jonathan Smith has only one spelling mistake, why is the ratio so low?
Moreover, would there be an option to overcome the problem so that Jonathan receives a higher score?
thanks for your help, Michael
Fuzz.token_set_ratiois not really the right ratio for your problem, since it sorts the words, while you would like to keep the pairing of first and second name. You could usefuzz.partial_ratioto compare only the best matching substring of the longer string to the shorter string.