Generate Bigrams Using Gensim Phrases and Concatenate with Tokenized Unigrams

Question

Generate Bigrams Using Gensim Phrases and Concatenate with Tokenized Unigrams

376 views Asked by Todd At 01 August 2022 at 00:36

I am trying to include frequently shown bigrams into a set of unigram tokens using Gensim Phrases function but here I am stuck at the last stage.
What I am getting as an output is shown below (Having) where all the tokens are further broken down into a character-level and some of the characters are paired up (e.g. y_o).
But what I want to see as an output is shown below (Want).
In other words:
(1) from the 'col1' raw strings in a Pandas Dataframe format, remove stop-words and save the output in 'col2'.
(2) Then, generate bigrams using Gensim Phrases and save the output in 'col3'.
(3) add the outputs of 'col2' and 'col3' together into a 'col4' but keep all the outputs from 'col2' while you only include bigrams from 'col3'.
Which part I am having a wrong code? Please see my codes below.

import pandas as pd
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models import Phrases

#example data.
data = {
  "col1": ['the mayor of new york was there machine learning good place',
   'good place machine learning can be useful sometimes in new york', 'new york mayor was present new york machine learning new york']}

#load data into a DataFrame object.
df = pd.DataFrame(data)

#remove stop-words using simple_preprocess.
def preprocess(text):
    result = []

    for token in gensim.utils.simple_preprocess(text): 
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            result.append(token)
    return result
#apply the above function.
df['col2']=df['col1'].map(preprocess)

#build a bigram model using Phrases.
def birams(texts):
    bigram = gensim.models.Phrases(texts, min_count=1, threshold=1)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    return [bigram_mod[doc] for doc in texts]
#apply the above function.
df['col3']=df['col2'].map(birams)
print (df)

Having:
col1  \
0  the mayor of new york was there machine learning good place   
1  good place machine learning can be useful sometimes in new york   
2  new york mayor was present new york machine learning new york  

col2  \
0  ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place']   
1  ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york']   
2  ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york'] 

col3  
0  [[m_a, y_o, r], [n_e, w], [y_o, r, k], [m_a, c...  
1  [[g, o, o, d], [p, l, a_c, e], [m, a_c, h, i_n...  
2  [[n_e, w], [y_o, r_k], [m_a, y_o, r], [p, r, e...

Want:
col1  \
0  the mayor of new york was there machine learning good place   
1  good place machine learning can be useful sometimes in new york   
2  new york mayor was present new york machine learning new york 

col2  \
0  ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place']   
1  ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york']   
2  ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york'] 

col3  \
0  ['mayor', 'new_york', 'machine_learning', 'good_place']  
1  ['good_place', 'machine_learning', 'useful', 'new_york']  
2  ['new_york', 'mayor', 'present', 'new_york', 'machine_learning', 'new_york']

col4  
0  ['mayor', 'new', 'york', 'machine', 'learning', 'good', 'place', 'new_york', 'machine_learning', 'good_place']  
1  ['good', 'place', 'machine', 'learning', 'useful', 'new', 'york','good_place', 'machine_learning', 'new_york']  
2  ['new', 'york', 'mayor', 'present', 'new', 'york', 'machine', 'learning', 'new', 'york', 'new_york', 'new_york', 'machine_learning', 'new_york']

Original Q&A

There are 1 answers

**gojomo** · Answer 1 · 2022-08-01T04:35:16+00:00

First problem: your map onto the data means more than one Phrases model is being trained - each with only a single one of your texts.

Which triggers the second problem: each of your texts is a list of individual words. But Phrases expects one entire corpus – a Python re-iterable sequence (such as a list) that itself has, as each item, a list of words. So you're instead passing in a corpus that's just a list of words – so each word looks like a list of single-character tokens.

I recommend eliminating Pandas structures entirely from this part of your project. It adds extra overhead and indirection. Use plain Python data structures, and create only one Phrases that works on your entire (preprocessed, tokenized) corpus.

TechQA.

Generate Bigrams Using Gensim Phrases and Concatenate with Tokenized Unigrams

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in GENSIM

Related Questions in N-GRAM

Related Questions in PHRASE

Popular Questions

Trending Questions