Efficient Text Search for Large Term List: Python vs PostgreSQL vs Elasticsearch

171 views Asked by At

I have a list containing terms that vary in length from 1 to 10 words, with approximately 500,000 entries. My goal is to search for these terms in a long text (converted from a PDF, typically 1.5 to 2 pages long). I need to perform the searches not only as exact matches but also using fuzzy (e.g., the term 'Lionel Messi' should match 'Lionel Mesi' in the text) and near options (e.g., the term 'Lionel Messi' should match 'Lionel J. Messi' in the text).

I aim to solve this problem in near real-time (1-2 second). I've tried using trie data structures and parallelization, but especially when the fuzzy aspect comes into play, the large size of the list and the PDF length lead to long processing times (about 30 seconds).

How should I approach this problem?

  1. Can I handle it on the fly with Python libraries (using parallelization, trie structures, etc.)?
  2. Are there features in PostgreSQL that support such searches?
  3. Should I use a framework like Elasticsearch?"
3

There are 3 answers

1
Florian Klein On

postgres offers native full-text search capabilities like https://www.postgresql.org/docs/current/textsearch.html or https://www.postgresql.org/docs/current/pgtrgm.html

and there are also extensions like the recent https://docs.paradedb.com/blog/introducing_bm25 or https://github.com/pgvector/pgvector that work on vectors generated (f.e) by LLMs.

0
Musab Dogan On

Selam Batuhan, Elasticsearch fuzzy search can handle this query near real-time. In fact, this is the main reason why Lucene, Elasticsearch or Opensearch exist.

Have you indexed your data in elasticsearch and measured time it took for a Lionel Messi search?

Here is a good article that explain the Damerau-Levenshtein distance. https://www.elastic.co/blog/found-fuzzy-search

0
philippemnoel On

You have a few options:

  • Elasticsearch, Typesense, etc. can definitely give you this. You will need to manage them, though, which is a separate system.

  • For Postgres, I would recommend pg_bm25: https://github.com/paradedb/paradedb/tree/dev/pg_bm25. It's free and benchmarks show this type of search takes ~30ms. You can see them here: https://github.com/paradedb/paradedb/blob/dev/benchmarks/README.md#pg_bm25 Disclaimer: I work on pg_bm25.

  • You can get surprisingly far with Postgres tsvector. I don't know if you can get your 1-2second search for this much data, but you can try and see. That's pretty low overhead.

  • Lastly, there's a tool called ZomboDB which allows interoperating Postgres and Elasticsearch. If you really want to use both, it's probably you best bet to streamline your workflow :)