I struggle to find the correct way to count types (unique forms of words) in a Quanteda corpus. ntype() gives the number of types per document, but not for the corpus as a whole.
I found two ways to get this count, which give two different results and I don’t understand why.
Reproductible code:
library(quanteda)
corp_uk <- corpus(data_char_ukimmig2010)
corp_uk_tokens <- tokens(corp_uk, remove_punct = TRUE)
nfeat(dfm(corp_uk_tokens))
length(types(corp_uk_tokens))
nfeat(dfm(corp_uk_tokens)) outputs 1648
length(types(corp_uk_tokens)) outputs 1804
Which one is correct and why those two calculations don’t give the same result?
Thanks a lot for helping!
It's because
dfm()hastolower = TRUEas a default, so thenfeat()has combined some types due to lowercasing. If you turn this off, you will get the same result as the length of thetypes().Created on 2024-03-28 with reprex v2.1.0