I want to calculate text similarity by using only the words of a specific POS tag. Currently I am calculating similarity using cosine method but it does not take into account POS tagging.
A <- data.frame(name = c(
"X-ray right leg arteries",
"consultation of gynecologist",
"x-ray leg arteries",
"x-ray leg with 20km distance"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"consultation (inspection) of gynecalogist",
"MRI right leg arteries",
"X-ray right leg arteries with special care"
), stringsAsFactors = F)
corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")
docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")
dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
cosines <- lapply(docnames(corp2),
function(x) textstat_simil(dtm3[c(x, docnames(corp1)), ],
method = "cosine",
selection = x)[-1, , drop = FALSE])
do.call(cbind, cosines)
In the above example, "X-ray right leg arteries" should not be mapped to "MRI right leg arteries" as these are two different categories of services. Unfortunately, I don't have explicit categorization of services. I only have services text. Is it possible by using POS tagging I can assign more importance to these words - "X-ray", "consultation", "leg" and "arteries". The services mentioned in the code are just a sample. In reality, I have more than 10K services. I explored udpipe package for PoS tagging but didn't get much success.
In order to do pos tagging with udpipe, you can do as follows (based on your example data A & B).
If you want to calculate similarities based on a document term matrix of the lemma's, do as follows (uses
sim2fromtext2vecR package)If you also want to add ngrams of nouns in the game, do as follows. Extract nouns following one another, create a document/term/matrix of this new compound term and combine it with the exising document term matrix in order to easily run document similarities.