How does the model.resize_token_embeddings() function refactor the embeddings for newly added tokens in the tokenizer?

362 views Asked by At

I am new to Natural Language Processing and currently working on machine translation using ALMA-7B model from Hugging Face. I wanted to create custom tokenizer based on the tokens that I have in my Word2Vec Embeddings and I also have their corresponding Embeddings (weights) with me. I am adding the tokens to tokenizers using following code:

alma_tokenizer.add_tokens(word_chunks)

Where alma_tokenizer is the Tokenizer for ALMA-7B model and word_chunks is a list of words I want to add. I want to update the model with its corresponding word embeddings as well, in the model and I was suggested to use resize_token_embeddings() function of AutoModelForCausalLM. When used it actually created new embeddings for the tokens I had added and I confirmed it as well. But my question is how are these embeddings created? Are they created randomly (as they are not a tensor of zeroes)? Can I insert my embeddings instead of the embeddings created by them?

Any kind of help will be appreciated!

embeddings=model.resize_token_embeddings(len(tokenizer))
1

There are 1 answers

3
i4never On

transformers.modeling_utils.PreTrainedModel.resize_token_embeddings (https://github.com/huggingface/transformers/blob/38611086d293ea4a5809bcd7fadd8081d55cb74e/src/transformers/modeling_utils.py#L1855C14-L1855C27).

_get_resized_embeddings is finally called and Model._init_weights will be used to initialize new embedding. Than new_embeddings.weight.data[:n, :] = old_embeddings.weight.data[:n, :] will make sure the old token embedding remains the same.

As far as I know ALMA shares same architecture as Llama. Below is the _init_weight function in transformers.models.llama.modeling_llama:

    def _init_weights(self, module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

For ALMA, new token embedding will be initialized with normal distribution of mean=0 and var=std (which defined in model config)

Of course you can insert your embeddings.

Method 1 rewrite model._init_weights

def _my_init_weights(self, module):
    std = self.config.initializer_range
    if isinstance(module, nn.Linear):
        module.weight.data.normal_(mean=0.0, std=std)
        if module.bias is not None:
            module.bias.data.zero_()
    elif isinstance(module, nn.Embedding):
        # replace following line with you embedding initialization here 
        module.weight.data.normal_(mean=0.0, std=std)
        if module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()

Method 2 do it manually

my_embedding = nn.Embedding(...)
# do your initialize 
alma_model.model.embed_tokens = my_embedding

If you do it manually, Don't forget to resize lm_head either. You may need to update parameters in model.config as well