Generate Bigrams Using Gensim Phrases and Concatenate with Tokenized Unigrams

376 views Asked by At

I am trying to include frequently shown bigrams into a set of unigram tokens using Gensim Phrases function but here I am stuck at the last stage.
What I am getting as an output is shown below (Having) where all the tokens are further broken down into a character-level and some of the characters are paired up (e.g. y_o).
But what I want to see as an output is shown below (Want).
In other words:
(1) from the 'col1' raw strings in a Pandas Dataframe format, remove stop-words and save the output in 'col2'.
(2) Then, generate bigrams using Gensim Phrases and save the output in 'col3'.
(3) add the outputs of 'col2' and 'col3' together into a 'col4' but keep all the outputs from 'col2' while you only include bigrams from 'col3'.
Which part I am having a wrong code? Please see my codes below.

import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models import Phrases

#example data.
data = {
  "col1": ['the mayor of new york was there machine learning good place',
   'good place machine learning can be useful sometimes in new york', 'new york mayor was present new york machine learning new york']}

#load data into a DataFrame object.
df = pd.DataFrame(data)

#remove stop-words using simple_preprocess.
def preprocess(text):
    result = []

    for token in gensim.utils.simple_preprocess(text): 
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            result.append(token)
    return result
#apply the above function.
df['col2']=df['col1'].map(preprocess)

#build a bigram model using Phrases.
def birams(texts):
    bigram = gensim.models.Phrases(texts, min_count=1, threshold=1)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    return [bigram_mod[doc] for doc in texts]
#apply the above function.
df['col3']=df['col2'].map(birams)
print (df)

Having:
col1  \
0  the mayor of new york was there machine learning good place   
1  good place machine learning can be useful sometimes in new york   
2  new york mayor was present new york machine learning new york  

col2  \
0  ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place']   
1  ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york']   
2  ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york'] 

col3  
0  [[m_a, y_o, r], [n_e, w], [y_o, r, k], [m_a, c...  
1  [[g, o, o, d], [p, l, a_c, e], [m, a_c, h, i_n...  
2  [[n_e, w], [y_o, r_k], [m_a, y_o, r], [p, r, e...

Want:
col1  \
0  the mayor of new york was there machine learning good place   
1  good place machine learning can be useful sometimes in new york   
2  new york mayor was present new york machine learning new york 

col2  \
0  ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place']   
1  ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york']   
2  ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york'] 

col3  \
0  ['mayor', 'new_york', 'machine_learning', 'good_place']  
1  ['good_place', 'machine_learning', 'useful', 'new_york']  
2  ['new_york', 'mayor', 'present', 'new_york', 'machine_learning', 'new_york']

col4  
0  ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place', 'new_york', 'machine_learning', 'good_place']  
1  ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york','good_place', 'machine_learning', 'new_york']  
2  ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york', 'new_york', 'new_york', 'machine_learning', 'new_york']

1

There are 1 answers

0
gojomo On

First problem: your map onto the data means more than one Phrases model is being trained - each with only a single one of your texts.

Which triggers the second problem: each of your texts is a list of individual words. But Phrases expects one entire corpus – a Python re-iterable sequence (such as a list) that itself has, as each item, a list of words. So you're instead passing in a corpus that's just a list of words – so each word looks like a list of single-character tokens.

I recommend eliminating Pandas structures entirely from this part of your project. It adds extra overhead and indirection. Use plain Python data structures, and create only one Phrases that works on your entire (preprocessed, tokenized) corpus.