Full text document similarity search

239 views Asked by At

I have big database of articles and I'd like before adding new items to DB check if already similar items exist and if so - group them together, so that later I can easily display them as a group of similar items.

Currently we use very simple, but shockingly very precise and our needs fully satisfying PHP's similar_text() function. The problem is, that before we add an item to DB, we first need to pull X amount of items from DB to then loop through every single one in order to check whether our new item is at least 75% similar to other items in order to group them together. This uses a lot of resources and time that we don't really have.

We use MySQL and Solr for all our queries. I've tried using MySQL Full-Text Search, Solr More like this. Compared to PHPs implementation, they are super fast and efficient, but I just can't get a robust percentage score which PHP similar_text() provides. It is crucial for our grouping to be accurate.

For example using this MySQL query:

SELECT id, body,  ROUND(((MATCH(body) AGAINST ('ARTICLE TEXT')) / scores.max_score) * 100) as relevance
FROM natural_text_test,
     (SELECT MAX(MATCH(body) AGAINST('ARTICLE TEXT')) as max_score FROM natural_text_test LIMIT 1) scores
HAVING relevance > 75
ORDER BY relevance DESC

i get that article with 130 words is 85% similar with another article with 4700 words. And in comparison PHP's similar_text() returns only 3% similarity score which is well below our threshold and is correct in our case.

I've also looked into Levenshtein distance algorithm, but it seems that the same problem as with MySQL and Solr arises.

There has to be a better way to handle similarity checks, maybe I'm using the algorithms incorrectly?

1

There are 1 answers

0
Rick James On

Based on some of the Comments, I might propose this...

It seems that 75%-similar documents would have a lot of the same sentences in the same order.

  1. Break the doc into sentences
  2. Take a crude hash of each sentence, map it to a visible ascii character. This gives you a string that is, perhaps, 1/100th the size of the original doc.
  3. Store that with the doc.
  4. When searching, use levenshtein() on this string to find 'similar' documents.

Sure, hashing is imperfect, etc. But this is fast. And you could apply some other technique to double-check the few docs that are close.

For a hash, I might do

$md5 = md5($sentence);
$x = somehow get 6 bits out of that hex string
$hash = chr(ord('0' + $x));