Cosine similarity between words using BERT model

53 views Asked by At

I have a simple script where I want to check the similarity between words "Cat" and "Dog"

from transformers import BertModel, BertTokenizer
import torch
from scipy.spatial.distance import cosine

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

tokens_cat = tokenizer("Cat", return_tensors="pt")
tokens_dog = tokenizer("Dog", return_tensors="pt")

# Get BERT embeddings
with torch.no_grad():
    embeddings_cat = model(**tokens_cat).last_hidden_state.mean(dim=1).squeeze().numpy()
    embeddings_dog = model(**tokens_dog).last_hidden_state.mean(dim=1).squeeze().numpy()

# Calculate cosine similarity
cosine_similarity = 1 - cosine(embeddings_cat, embeddings_dog)

print(f"Cosine Similarity: {cosine_similarity}")

The code above returns 0.8319976329803467, which is weird because these words are not similar. Could you please tell me what I'm doing wrong?

I tried to ask chatGPT and it keeps telling me that this code is right.

0

There are 0 answers