Updating old code with new torch vocab methods (stoi and itos methods changed)

177 views Asked by At

I am trying to create a Japanese-English translation model following this Medium article. https://arusl.medium.com/japanese-english-language-translation-with-transformer-using-pytorch-243738146806 Everything runs perfectly until the second to last cell, when I get an error running the translate function. The error is specifically on this line.

tokens = [BOS_IDX] + [src_vocab.stoi[tok] for tok in src_tokenizer.encode(src, out_type=str)]+ [EOS_IDX]

The error: AttributeError: 'Vocab' object has no attribute 'stoi'. Since the article was written, the method .stoi has changed to get_stoi() → Dict[str, int] according to the torchtext documentation (https://pytorch.org/text/stable/vocab.html). When I attempt to change the line to the following, however, I get the error "Counter object has no attribute 'get_stoi'."

tokens = [BOS_IDX] + [src_vocab.get_stoi()[tok] for tok in src_tokenizer.encode(src, out_type=str)]+ [EOS_IDX]

The same goes for the itos and get_itos() method. If I try to use the method as Any help for how to make this work would be greatly appreciated as I'm very dumbfounded at the moment.

A similar question was asked here but I don't see how to implement the answer or make it work in this case. 'Vocab' object has no attribute 'itos'

Edit: This function seems suspect as it is creating vocab out of a counter... is there a better way to do this?

def build_vocab(sentences, tokenizer):
  counter = Counter()
  for sentence in sentences:
    counter.update(tokenizer.encode(sentence, out_type=str))
  return Vocab(counter)

Thank you!

2

There are 2 answers

2
Azury On

Hmm... After reading the Medium article, I think you're encountering this error due to changes in the torchtext library's methods for vocabulary handling. The 'Vocab' object no longer has the 'stoi' attribute; it has been replaced with 'get_stoi()'. Similarly, 'itos' is now 'get_itos()'.

To fix the error, update your code:

Original code:

tokens = [BOS_IDX] + [src_vocab.stoi[tok] for tok in src_tokenizer.encode(src, out_type=str)] + [EOS_IDX]

Updated code:

tokens = [BOS_IDX] + [src_vocab.get_stoi()[tok] for tok in src_tokenizer.encode(src, out_type=str)] + [EOS_IDX]

Make the same change for 'itos' to 'get_itos()', if applicable.

1
dust as On

update code: tokens = [BOS_IDX] + [src_vocab[tok] for tok in src_tokenizer.encode(src, out_type=str)] + [EOS_IDX]