Fuzzy search over millions of strings with custom distance function

138 views Asked by At

I have a large pool of short strings and a custom distance function on them (let's say Damerau–Levenshtein distance).

Q: What is the state-of-the-art solution for getting top N strings from the pool according to the custom distance?

I am looking for both a theoretical approach to this problem as well as coded implementation (Java, Python, etc).

1

There are 1 answers

0
MrSmith42 On

The straight forward approach is to iterate over all strings, calculate the distance for each and keep only the best N while you iterate.

If you need to do this task a lot, you should think if you can come up with a upper-bound / lower bound estimation for the costs that can be calculated much faster than your real cost function. E.g. pre-calculate all n-grams (e.g. 3-grams) for your strings. or maybe comparing the length difference can already give a lower bound for the distance. than you can skip the calculation of the distance for all strings which have a lower bound distance higher than your current distance of the n-th best match.