Why is the token set ratio so low using fuzzywuzzy?

Question

Why is the token set ratio so low using fuzzywuzzy?

1.8k views Asked by Michael Altorfer At 08 October 2020 at 09:50

I am using fuzzywuzzy and rapidfuzz to find names mentioned in comments. I read through the documentation of the "token_set_ratio" function but I still don't understand the following:

# I preprocessed the comments to remove stop words and commonly mentioned other words

fuzz.token_set_ratio("reporting michael anders sven straumann guy called jonatjan smith partners","jonathan smith")

# returns 52.6

Jonathan Smith has only one spelling mistake, why is the ratio so low?

Moreover, would there be an option to overcome the problem so that Jonathan receives a higher score?

thanks for your help, Michael

Original Q&A

There are 1 answers

**maxbachmann** · Answer 1 · 2020-10-09T07:36:58+00:00

Fuzz.token_set_ratio is not really the right ratio for your problem, since it sorts the words, while you would like to keep the pairing of first and second name. You could use fuzz.partial_ratio to compare only the best matching substring of the longer string to the shorter string.

fuzz.partial_ratio(
  "reporting michael anders sven straumann guy called jonatjan smith partners",
  "jonathan smith")
# returns 92.85714285714286

TechQA.

Why is the token set ratio so low using fuzzywuzzy?

There are 1 answers

Related Questions in PYTHON

Related Questions in TOKEN

Related Questions in FUZZYWUZZY

Related Questions in RAPIDFUZZ

Popular Questions

Trending Questions