> The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a."
I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting
Calavar•1h ago
> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.
This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].
Havoc•1h ago
I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting