Different embedding checksums after encoding with SentenceTransformers?

116 views Asked by At

I am calculating some embeddings with SentenceTransformers Library. However, I get different results when encoding the sentences and calculating their embeddings when checking the sum of their values. For instance:

In:


RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)


transformer_models = [
    'M-CLIP/M-BERT-Distil-40', 
                     ]

sentences = df['content'].tolist()


for transformer_model in tqdm(transformer_models, desc="Transformer Models"):
    tqdm.write(f"Processing with Transformer Model: {transformer_model}")
    model = SentenceTransformer(transformer_model)
    embeddings = model.encode(sentences)
    print(f"Embeddings Checksum for {transformer_model}:", np.sum(embeddings))

Out:

Embeddings Checksum for M-CLIP/M-BERT-Distil-40: 1105.9185

Or

Embeddings Checksum for M-CLIP/M-BERT-Distil-40: 1113.5422

I noticed this situation happens when I restart and clear the output of the jupyter notebook, and then re-run the full notebook. Any idea of how to fix this issue?

Alternative I tried to set after and before the embeddings calculation the reandom seeds:

import torch
import numpy as np
import random
import tensorflow as tf
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm

RANDOM_SEED = 42

# Setting seeds
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Ensuring PyTorch determinism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

transformer_models = ['M-CLIP/M-BERT-Distil-40']

sentences = df['content'].tolist()

for transformer_model in tqdm(transformer_models, desc="Transformer Models"):
    # Set the seed again right before loading the model
    np.random.seed(RANDOM_SEED)
    random.seed(RANDOM_SEED)
    tf.random.set_seed(RANDOM_SEED)
    torch.manual_seed(RANDOM_SEED)

    tqdm.write(f"Processing with Transformer Model: {transformer_model}")
    model = SentenceTransformer(transformer_model, device='cpu')  # Force to use CPU

    embeddings = model.encode(sentences, show_progress_bar=False)  # Disable progress bar and parallel tokenization
    print(f"Embeddings Checksum for {transformer_model}:", np.sum(embeddings))

However I am getting the same inconsistent behavior.

UPDATE

What I tried now, and seem to work is that now I store all the calculated embeddings in files. However, I find weird that when doing this I get different results. Does anyone has experience this before?

1

There are 1 answers

0
Soham Kanti Bera On

Please try using the .apply method instead of .encode --> for me, that worked for a similar application which helped to resolve the reproducibility issue.

It seems to be an ongoing issue faced by many people. You can follow this issue for more information on different embeddings depending on different batch sizes/other settings and their solutions. [different precision settings, specifying/ensuring consistent tokenization and padding of the input, etc.]