why Tokenizer and TokenizerFast encode the same sentence get different result

25 views Asked by At

error1 when i use tokenizer encode text and use ‘do_basic_tokenize=False’, i found two different results. But when i set ‘do_basic_tokenize=True’, the results is same.

this text is 'LUXURY HOTEL EXPANSION CONTINUES -- The Ritz-Carlton Hotel Company has planted another flag in Indonesia , as major North American luxury hotels keep expanding throughout Asia .'

can you help me?

I want to find out why the results are different?

1

There are 1 answers

1
ewz93 On

According to the answer to this post this behavior is unintended and might depend on the specific version that you use.

On the huggingface website you can also find some discussion about this (or the same issue on different models), specifically here and here.

The latter one seems to imply that the difference does indeed result from slightly different tokenization, which would make sense since you said the difference only occurs on one of the do_basic_tokenize options.