I am calculating some embeddings with SentenceTransformers Library. However, I get different results when encoding the sentences and calculating their embeddings when checking the sum of their values. For instance:
In:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
transformer_models = [
'M-CLIP/M-BERT-Distil-40',
]
sentences = df['content'].tolist()
for transformer_model in tqdm(transformer_models, desc="Transformer Models"):
tqdm.write(f"Processing with Transformer Model: {transformer_model}")
model = SentenceTransformer(transformer_model)
embeddings = model.encode(sentences)
print(f"Embeddings Checksum for {transformer_model}:", np.sum(embeddings))
Out:
Embeddings Checksum for M-CLIP/M-BERT-Distil-40: 1105.9185
Or
Embeddings Checksum for M-CLIP/M-BERT-Distil-40: 1113.5422
I noticed this situation happens when I restart and clear the output of the jupyter notebook, and then re-run the full notebook. Any idea of how to fix this issue?
Alternative I tried to set after and before the embeddings calculation the reandom seeds:
import torch
import numpy as np
import random
import tensorflow as tf
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
RANDOM_SEED = 42
# Setting seeds
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
# Ensuring PyTorch determinism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
transformer_models = ['M-CLIP/M-BERT-Distil-40']
sentences = df['content'].tolist()
for transformer_model in tqdm(transformer_models, desc="Transformer Models"):
# Set the seed again right before loading the model
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
tqdm.write(f"Processing with Transformer Model: {transformer_model}")
model = SentenceTransformer(transformer_model, device='cpu') # Force to use CPU
embeddings = model.encode(sentences, show_progress_bar=False) # Disable progress bar and parallel tokenization
print(f"Embeddings Checksum for {transformer_model}:", np.sum(embeddings))
However I am getting the same inconsistent behavior.
UPDATE
What I tried now, and seem to work is that now I store all the calculated embeddings in files. However, I find weird that when doing this I get different results. Does anyone has experience this before?
Please try using the
.applymethod instead of.encode--> for me, that worked for a similar application which helped to resolve the reproducibility issue.It seems to be an ongoing issue faced by many people. You can follow this issue for more information on different embeddings depending on different batch sizes/other settings and their solutions. [different precision settings, specifying/ensuring consistent tokenization and padding of the input, etc.]