"Why are the cosine similarities calculated by the library and by myself different?"

Question

"Why are the cosine similarities calculated by the library and by myself different?"

131 views Asked by ju so At 08 March 2023 at 08:39

I'm currently building a book recommendation system and I want to use KNN algorithm for collaborative filtering. I think I know the process of KNN algorithm well, and I want to use item-based approach for which I need to calculate the similarity between item vectors. However, there's a difference between the similarity calculated by the library and the one I calculated myself, and I'm not sure what the cause is. Can you help me out?

from surprise import Dataset, Reader, KNNWithMeans
# 데이터프레임 생성
ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}
df = pd.DataFrame(ratings_dict)


# Surprise 라이브러리에서 사용할 데이터셋 형태로 변환
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

# 유사도 행렬 계산 (item_based)
sim_options = {'name': 'cosine', 'user_based': False}
algo = KNNWithMeans(sim_options=sim_options)
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)

similarity_matrix = algo.compute_similarities()
print(similarity_matrix)

this code results

[[1. 0.96954671] [0.96954671 1. ]]

item    1    2
user          
A     1.0  2.0
B     2.0  4.0
C     2.5  4.0
D     4.5  5.0
E     3.0  NaN

but

import numpy as np

# 두 벡터 정의
vector1 = np.array([1, 2, 2.5, 4.5, 3])
vector2 = np.array([2, 4, 4, 5, 0])


# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


print(cosine_sim_1)

this code results

0.8550598237348973

I think the surprise library filled NaN values with something other than 0. I expected it to be 0, but it seems like another value was used instead.

I tried ChatGPT, but it couldn't help me solve the issue.

Original Q&A

There are 1 answers

**Kilian** · Accepted Answer · 2023-03-08T08:58:29+00:00

vector1 = np.array([1, 2, 2.5, 4.5])
vector2 = np.array([2, 4, 4, 5])

# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)

The first part of your code just calculates the cosine similarity of the 4D vectors, omitting the last values, one of which is NaN

TechQA.

"Why are the cosine similarities calculated by the library and by myself different?"

There are 1 answers

Related Questions in PYTHON

Related Questions in KNN

Related Questions in RECOMMENDATION-ENGINE

Related Questions in COLLABORATIVE-FILTERING

Popular Questions

Trending Questions