I'm currently building a book recommendation system and I want to use KNN algorithm for collaborative filtering. I think I know the process of KNN algorithm well, and I want to use item-based approach for which I need to calculate the similarity between item vectors. However, there's a difference between the similarity calculated by the library and the one I calculated myself, and I'm not sure what the cause is. Can you help me out?
from surprise import Dataset, Reader, KNNWithMeans
# 데이터프레임 생성
ratings_dict = {
"item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
"user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
"rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}
df = pd.DataFrame(ratings_dict)
# Surprise 라이브러리에서 사용할 데이터셋 형태로 변환
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)
# 유사도 행렬 계산 (item_based)
sim_options = {'name': 'cosine', 'user_based': False}
algo = KNNWithMeans(sim_options=sim_options)
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)
similarity_matrix = algo.compute_similarities()
print(similarity_matrix)
this code results
[[1. 0.96954671] [0.96954671 1. ]]
item 1 2
user
A 1.0 2.0
B 2.0 4.0
C 2.5 4.0
D 4.5 5.0
E 3.0 NaN
but
import numpy as np
# 두 벡터 정의
vector1 = np.array([1, 2, 2.5, 4.5, 3])
vector2 = np.array([2, 4, 4, 5, 0])
# 코사인 유사도 계산
cosine_sim_1 = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(cosine_sim_1)
this code results
0.8550598237348973
I think the surprise library filled NaN values with something other than 0. I expected it to be 0, but it seems like another value was used instead.
I tried ChatGPT, but it couldn't help me solve the issue.
The first part of your code just calculates the cosine similarity of the 4D vectors, omitting the last values, one of which is NaN