I am trying to generate the Cosine similarity between two words in a sentence. The sentence is "The black cat sat on the couch and the brown dog slept on the rug".
My Python code is below:
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action = 'ignore')
import gensim
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
sentence = "The black cat sat on the couch and the brown dog slept on the rug"
# Replaces escape character with space
f = sentence.replace("\n", " ")
data = []
# sentence parsing
for i in sent_tokenize(f):
temp = []
# tokenize the sentence into words
for j in word_tokenize(i):
temp.append(j.lower())
data.append(temp)
print(data)
# Creating Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 512, window = 5, sg = 1)
# Print results
print("Cosine similarity between 'black' " +
"and 'brown' - Skip Gram : ",
model2.wv.similarity('black', 'brown'))
As "black" and "brown" are of colour type, their cosine similarity should be maximum (somewhere around 1). But my result shows following:
[['the', 'black', 'cat', 'sat', 'on', 'the', 'couch', 'and', 'the', 'brown', 'dog', 'slept', 'on', 'the', 'rug']]
Cosine similarity between 'black' and 'brown' - Skip Gram : 0.008911405
Any idea what is wrong here? Is my understanding about cosine similarity correct?
If you're training your own word2vec model, as you show here, it needs a large dataset of varied in-context examples of word usage to create useful vectors. It's only the push-pull of trying to model tens of thousands of different words, in many subtly-varied usages, that moves the word-vectors to places where they reflect relative meanings.
That usefulness won't happen with a training corpus of just 15 words, or for words with few usage examples. (There's a good reason the default
min_countis 5, and in general you should try to increase that value, as your data becomes large enough to allow it, rather than decrease it.)Generally, word2vec can't be well-demonstrated or understood with toy-sized examples. Further, to even create word-vectors of common dimensionalities of 100 to 400 dimensions, it's best to have training texts in the millions or billions of words. You need even more training words to support even-larger dimensions, like your
vector_size=512choice.So some potential options for you are:
if you want to train your own model, find a lot more training-texts, use a smaller
vector_size, and largermin_count; oruse someone else's pretrained sets of word-vectors, which can be loaded into a Gensim
KeyedVectorsobject (vectors without associated training model)