fp.

    Training corpus: "The fat cat sat on the mat"

    Input -> Label
    --------------
    "The" -> " fat"
    "The fat" -> " cat"
    "The fat cat" -> " sat"

Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1).

    Training corpus: "The fat cat sat on the mat"
    Input (7 tokens): "The fat cat sat on the mat"
    Output logit (7 tokens): "mat fat sat on fat mat and"
    Shifted label (7 tokens): "fat cat sat on the mat <ignore>"

Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.

blackbear_•4mo ago

The two ways are equivalent (it's always next token prediction) but the latter is way more efficient as it computes the loss for N tokens in a single forward pass.