Using python for text analytics

Question

Using python for text analytics

3.5k views Asked by Ebelechukwu Nwafor At 10 June 2015 at 23:58

I am trying to write a program that searches if a list of words are contained in a text file. I was thinking of using the intersection of two sets to accomplish this. I am wondering if there is any other efficient way of achieving this?

Original Q&A

There are 2 answers

**Alexander McFarlane** · Answer 1 · 2015-06-11T00:37:59+00:00

Quick & Easy Method

textblob is a library for text analysis.

This part of the docs describes how to you obtain word and noun frequencies e.g.

from textblob import TextBlob

>>> monty = TextBlob("We are no longer the Knights who say Ni. "
...                     "We are now the Knights who say Ekki ekki ekki PTANG.")
>>> monty.words.count('ekki', case_sensitive=False)
3

Higher Performance, Slower Method

If you are looking for high performance and this is a big issue, perhaps try cleaning the file into a list of words with regex and then get frequencies by using Collections:

from collections import Counter
words = ['b','b','the','the','the','c']

print Counter(words)
Counter({'the': 3, 'b': 2, 'c': 1})

Higher Performance Method for a Single Query

or for even higher performance for a single non-repeated query (if you are going to query lots of words, store as a Counter object):

words.count('the')
3

If you want even higher performance then use a high performance programming language!

**Utsav T** · Answer 2 · 2015-06-11T00:43:03+00:00

Hashing can also be used for a quick lookup.

Read the file and parse the text.
Keep storing each unseen(new) word in a hashtable.
Finally, check each of your word in your lookup list if it is present in the hashtable

Dictionaries in Python are implemented using hash tables. So, it could be a good choice. This could be a starter code -

dictionary  = {}
lookup_list = ["word1","word2","word3"]
file_data = []

with open("myfile.txt","r") as f:
    file_data = f.read().split()

for word in file_data:
    if word not in dictionary.keys():
        dictionary[word] = 1
    else:
        dictionary[word] += 1

f.close()

result = [i for i in lookup_list if i in dictionary.keys()]

print result

TechQA.

Using python for text analytics

There are 2 answers

Quick & Easy Method

Higher Performance, Slower Method

Higher Performance Method for a Single Query

Related Questions in PYTHON

Related Questions in TEXT

Related Questions in TEXT-MINING

Related Questions in TEXT-EXTRACTION

Popular Questions

Trending Questions