I am doing topic modeling with the topicmodels-package and a corpus consisting of three documents.
model <- LDA(dat_dtm, method = "VEM", k = 3, control = list(alpha = 0.1))
Output:
A LDA_VEM topic model with 3 topics.
After that, I use the terms-function to obtain the top 5 words of each model.
terms(model, 5)
Outuput with made up words:
| topic 1 | topic 2 | topic 3 |
|---|---|---|
| strong | poor | class |
| wealth | struggle | middle |
| money | homeless | money |
| power | money | sufficient |
| rich | wealth | wealth |
As you can see, the words "money" and "wealth" appear in each topic, but they are not really meaningful for my analysis. So I thought it might be a good idea to remove these words from the whole corpora and conduct a new topic model without them. I tried to do this automatically by telling R that it should observe the top 20 words for each topic and remove all words from the corpora which are in each topic under the top 20. However, I only generated errors because I am not really familiar with the topicmodels-package. Obviously, you can just add these words to the stop word list manually, but maybe there is a more professional way to do it?
Thank you in advance!
I think the easiest way is to make a vector object of the top 20 words and add it to your stop word list.
You can use
tidyverseto specify these words for each topic.Now you have a vector object called
remove_words, which should be added into your stop word list before conducting a new topic model.Hope this helps!