How to quantize sentence-transformer model on CPU to use it on GPU?

204 views Asked by At

I wanted to use the 'Salesforce/SFR-Embedding-Mistral' embedding model, but it is too large for the GPU partition I have access to. Therefore, I considered quantizing the model, but I couldn't find a pre-quantized version available.

When I attempted to quantize it using bitsandbytes, it tries to load the entire model onto the GPU, which resulted in the same error.

model = AutoModel.from_pretrained(
    'Salesforce/SFR-Embedding-Mistral',
    trust_remote_code=True,
    device_map='auto',
    torch_dtype=torch.bfloat16,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
)

Then, I tried to load the model onto the CPU first and then quantize it before moving the quantized model to the GPU:

model.to('cpu')
if torch.cuda.is_available():
    model.to('cuda')

However, bitsandbytes does not support changing devices for quantized models:

ValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and cast to the correct `dtype`.

The solutions I found, such as this GitHub issue and this blog post, were not helpful or are outdated.

0

There are 0 answers