I am trying to finetune a Sentence Transformer Embedding model to the NF Corpus. The dataset consists of a qrels folder which has train.tsv, test.tsv and dev.tsv files. It basically maps a query ID to multiple passage IDs.
I was following the method provided by Sentence Transformer website. Github Link. In this example, they use a Cross-Encoder that maps query ID to passage ID with a score. I tried to replicate this with one to one query-to-sentence mapping.
My plan is to use this augmented dataset for training. I am planning to use Embedding Similarity Evaluator and Cosine Similarity Loss when training. I understand that the training data itself considers a Sentence Transformers' results as the ground truth.
I have coded this because my assumption is you cannot really map a query to a passage, you need one to one mapping of query and sentence in the corpus; or at least that is what I have understood when looking at the various losses and evaluation classes provided by Sentence Transformers.
I guess my question is does this really make sense? Because I am essentially using a Bi-Encoder and a Cross-Encoder combination to train a Bi-Encoder.
This is the code that I have written.
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder
class Vectorstore:
def __init__(self,model,documents):
self.index = faiss.IndexFlatL2(model.get_sentence_embedding_dimension())
self.embeddings = model.encode(documents)
faiss.normalize_L2(self.embeddings)
self.index.add(self.embeddings)
self.ce = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', device="cuda")
def sigmoid(self,x):
return 1/(1+np.exp(-x))
def vectorStore(self,model,documents,query):
tempDict = {}
returnDict = {}
search_vector = model.encode(query)
_vector = np.array([search_vector])
faiss.normalize_L2(_vector)
distances, ann = self.index.search(_vector, k=10)
for i,j in zip(distances[0],ann[0]):
tempDict[documents[j]] = i
for i in tempDict:
returnDict[self.sigmoid(self.ce.predict([query,i]))] = i
values = sorted(list(returnDict.keys()),reverse=True)[:2]
answer = {}
for i in values:
answer[i] = returnDict[i]
return answer