This is the approach the author has taken.
Training corpus: "The fat cat sat on the mat"
Input -> Label
--------------
"The" -> " fat"
"The fat" -> " cat"
"The fat cat" -> " sat"
Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1). Training corpus: "The fat cat sat on the mat"
Input (7 tokens): "The fat cat sat on the mat"
Output logit (7 tokens): "mat fat sat on fat mat and"
Shifted label (7 tokens): "fat cat sat on the mat <ignore>"
Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.
asimovDev•4mo ago
part 1