I am currently developing a T5 model (encoder-decoder architecture) from scratch for educational purposes. While working on this project, I've encountered some confusion regarding the pre-training objective, specifically the denoising objective. I would like to clarify my understanding and have some questions about the process.
Given the sentence:
Thank you for inviting me to your party last week.
Based on my understanding, during the pre-training phase with a denoising objective, the model works as follows:
- Encoder input:
Thank you <X> me to your party <Y> week - Decoder input:
<X> for inviting <Y> last - Decoder labels (true labels):
for inviting <Y> last <Z>
Here are my questions:
- Is my interpretation of how the encoder input, decoder input, and decoder labels are constructed correct?
- In this setup, the model is expected to predict sentinel tokens (e.g.,
<X>,<Y>). Could this potentially introduce confusion for the model, for example, it may take the idea that it is possible for the word "last" to come after the token ? Or does the model naturally learn to interpret these situations correctly?
Accordingly to the paper:
we process the sentence
Thank you for inviting me to your party last week.The wordsfor,invitingandlastare randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as<X>and<Y>) that is unique over the example. Sinceforandinvitingoccur consecutively, they are replaced by a single sentinel<X>. The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token<Z>.
