I'm trying to go through multiple sentences in a text. Each sentence is stored in nested list structure (i.e. a list of each sentence in the form of a list). I then want to apply POS tag to each 'token' in the sentence and store in another nested list structure. Ultimately this is so that I can add this to a dataframe and export to excel in 1 column (where each row is a sentence).
The trouble I'm having is the POS tag list only seems to capture the last sentence in the text. Here is part of the code.
for sentences in doc1.sents: #iterates over sentences in doc
for match_id, start, end in phrase_matcher(nlp(sentences.text)):
if nlp.vocab.strings[match_id] in ["key"]:
found_sentences = sentences.text
duplicate_sentence_list.append(found_sentences)
all_separated_words_list.append(text_preprocessing(found_sentences))
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
pos_list.append(tags)
When I try adding the POS tag to a for loop like below:
for sentences in doc1.sents: #iterates over sentences in doc
for match_id, start, end in phrase_matcher(nlp(sentences.text)):
if nlp.vocab.strings[match_id] in ["key"]:
found_sentences = sentences.text
duplicate_sentence_list.append(found_sentences)
all_separated_words_list.append(text_preprocessing(found_sentences))
for i in found_sentences:
pos_list.append(nltk.pos_tag(i))
i get this error:
TypeError: tokens: expected a list of strings, got a string
When i change the for loop to use the nested list (all_separated_words_list) I get this error:
`Output exceeds the size limit. Open the full output data in a text editor
AttributeError Traceback (most recent call last) /var/folders/6g/n1v5s0vj77xc2htytg4spx_r0000gn/T/ipykernel_17689/361983526.py in 14 all_separated_words_list.append(text_preprocessing(found_sentences)) 15 for i in found_sentences: 16 pos_list.append(nltk.pos_tag(all_separated_words_list)) 17 # tokens = nltk.word_tokenize(i) 18 # tags = nltk.pos_tag(tokens)
~/opt/anaconda3/lib/python3.9/site-packages/nltk/tag/init.py in pos_tag(tokens, tagset, lang) 164 """ 165 tagger = _get_tagger(lang) 166 return _pos_tag(tokens, tagset, tagger, lang) 167 168
~/opt/anaconda3/lib/python3.9/site-packages/nltk/tag/init.py in _pos_tag(tokens, tagset, tagger, lang) 121 122 else: 123 tagged_tokens = tagger.tag(tokens) 124 if tagset: # Maps to the specified tagset. 125 if lang == "eng":
~/opt/anaconda3/lib/python3.9/site-packages/nltk/tag/perceptron.py in tag(self, tokens, return_conf, use_tagdict) 178 output = [] ... 277 if word.isdigit() and len(word) == 4: 278 return "!YEAR" 279 if word and word[0].isdigit():
AttributeError: 'list' object has no attribute 'isdigit'`
So I'm not too sure how to proceed. Would appreciate any help
From the error message, it's telling you that it expected a string, but instead it got a list.
I suspect what's happening is that at this point you think you're giving it a single sentence, and then trying to iterate over the words in it, but
found_sentencesis actually list of sentences. So when it iterates over them it's finding a list (the tokenized sentence) instead of a string (the individual word).Go back over your code again, looking at the output of each line and you'll be able to see where it is going wrong.