How to make pattern matching efficient for large text datasets

37 views Asked by At

I'm currently working on a project that involves processing large volumes of textual data for natural language processing tasks. One critical aspect of my pipeline involves string matching, where I need to efficiently match substrings within sentences against a predefined set of patterns.

Here's a mock example to illustrate the problem with following list of sentences:

sentences = [
    "the quick brown fox jumps over the lazy dog",
    "a watched pot never boils",
    "actions speak louder than words"
]

And I have a set of patterns:

patterns = [
    "quick brown fox",
    "pot never boils",
    "actions speak"
]

My goal is to efficiently identify sentences that contain any of these patterns. Additionally, I need to tokenize each sentence and perform further analysis on the matched substrings.

Currently, I'm using a brute-force approach with nested loops, but it's not scalable for large datasets. I'm looking for more sophisticated techniques or algorithms to optimize this process.

How can I implement string matching for this scenario, considering scalability and performance? Any suggestions would be highly appreciated!

1

There are 1 answers

0
Andrej Kesely On

One approach to avoid brute search every sentence with every pattern is to create some sort of index. With this index you limit the search space needed to find the right sentence:

Example:

sentences = [
    "the quick brown fox jumps over the lazy dog",
    "a watched pot never boils",
    "actions speak louder than words",
]

patterns = ["quick brown fox", "pot never boils", "actions speak"]


def build_index(sentences):
    out = {}

    for s in sentences:
        for word in set(s.split()):
            out.setdefault(word, []).append(s)

    return out


def find_patterns(index, patterns):
    out = {}

    for p in patterns:
        word, *_ = p.split(maxsplit=1)
        out[p] = []
        for s in index.get(word, []):
            if p in s:
                out[p].append(s)

    return out


index = build_index(sentences)
print(find_patterns(index, patterns))

This prints:

{
    "quick brown fox": ["the quick brown fox jumps over the lazy dog"],
    "pot never boils": ["a watched pot never boils"],
    "actions speak": ["actions speak louder than words"],
}

More robust approach would be to use more powerful tools, such as full-text search in SQLite (that is shipping with Python, e.g.: Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together) etc.)