Matching hundreds non-adjaccent keywords in large text corpus in Python

32 views Asked by At

I need to match non-adjacent keywords in a large collection of texts (several thousands). If matched, a label is assigned, else a label "unkown" is assigned.

To provide an example, I would like to find the keywords sales representative and dealt in the below text snippet and assign it the category keyword pattern A:

Text: "The sales representative dealt with everything. It was very helpful to know that he compiled the best option for me."

  • The keyword pattern is thus sales representative and dealt
  • Since sales representative might be also called sales rep or customer rep, there are multiple keywords I need to match. The same holds true for the word dealt. So you see the where it gets complex.

There are many solutions for finding and matching unigrams or adjacent words (n-grams). I have implemented this myself. Now I need to find different keywords that are not written next to each other and assign a label. Also, I don't what is written between the different keyword. It could be anything.


I am approaching the problem with a lexical approach to look-up the keywords in a dictionary with different columns to accommodate the matching of single keywords, two keywords, or three keywords. Note that a keyword is always a unigram or a bigram. Also, I don't know what is written in between the keywords. Below some code I have written.
import pandas as pd 

#creat mock dictionary
Dict = pd.DataFrame({'word1':['dealt','dealt','dealt',''],
                     'word2':['sales representative','sales rep', 'customer rep', 'options']
                      }  )

#create sample text 
texts = ["The sales representative dealt with everything.",
"The sales rep dealt with everything.",
"The agent answered all questions" ,
"The customer rep answered all questions.",
"The agent dealt with everything."]

motive =[]
# only checks for the keyword in the first column  
for item in texts:
    item = str(item)
    if any(x in item for x in Dict['word1']):
    motive.append('keyword pattern A')        
    else:
        motive.append('unkown')

The label should only be assigned when dealt and sales rep are present in the text. So sentences 3 and 5 are incorrectly assigned. So I have up-dated the code. I runs through but does not assign any labels.

for item in texts:
    #convert into string
    item = str(item)
    #check if keyword can be found in first column
    tempM1 = {x for x in Dict['word1'] if x in item}
    #check if keyword was found
    if tempM1 != None:
        #if yes, locate all of their positions in the dictionary 
        for i in tempM1:
            i = -1
            #get row index 
            ind = Dict.index[Dict['word1'] == list(tempM1)[i+1]] 
    #gives pandas.core.indexes.base.Index            
    #check if column next to given row index is no empty             
            if pd.isnull(Dict['word2'].iloc[ind]) is False:
                #match keyword in second column
                tempM2 = {x for x in Dict['word2'] if x in item}
                #if second keyword was found
                if tempM2 != None: 
                    motive.append('keyword pattern A')
                else: 
            #check again first keyword column
                    tempM3 = {x for x in Dict['word1'] if x in item}
                    if tempM3 != None:
                        motive.append('keyword pattern A')
                    else: 
                        motive.append('unknown')

How to tweak above code?

I know about Regular Expression (RegEx). Seems to me that it will require more code lines and be less efficient given the amount of keywords (about 700 to 1000) and the combinations between them. Happy to be proven wrong, though!

Also, I know it can be viewed as a classification problem. Explanation and transparency are required in the project, so deep learning and the sorts it is not an option. For the same reason I am not considering embeddings.

Thanks!

1

There are 1 answers

0
JonSG On

Can you leverage all() and any() to find if a phrase contains "any" match from "all" match lists?

phrases_to_find = [
    [
        ["dealt"],
        ["sales representative", "sales rep", "customer rep"]
    ],
    [
        ["option"]
    ]
]

texts = [
    "The sales representative dealt with everything.",
    "The sales rep dealt with everything.",
    "The agent answered all questions" ,
    "The customer rep answered all questions.",
    "The agent dealt with everything.",
    "Here is some option."
]

motive =[]
for text in texts:
    for index, test_phrases in enumerate(phrases_to_find):
        if all(any(p in text for p in phrase) for phrase in test_phrases):
            motive.append(f'keyword pattern {index}')
            break
    else:
        motive.append('unknown')

print(motive)

That should give you:

[
    'keyword pattern 0',
    'keyword pattern 0',
    'unknown',
    'unknown',
    'unknown',
    'keyword pattern 1'
]