Even if true though not sure what we’d do with it. The bulk of knowledge available on the internet is text. Aside from maybe YouTube so I guess it could work for world model type things? Understanding physical interactions of objects etc
I would also argue tokens are outside the embedding space, and a large part of the magic of LLMs (and many other neural network types) is the ability to map sequences of rather crude inputs (tokens) into a more meaningful embedding space, and then map from a meaningful embedding space back to tokens we humans understand
https://google.github.io/deepvariant/posts/2020-02-20-lookin...
This goes against my limited understanding of how LLMs work — and computers generally for that matter. Isn’t that rendering of a smiling emoji still just a series of bits that need to be interpreted as a smiley face? The similar looking characters point makes more sense to me though assuming it’s something along the lines of recognizing that “S” and “$” are roughly the same thing except for the line down the middle. Still that seems like something that doesn’t come up much and is probably covered by observations made in the training corpus.
All that said, Karpathy knows way more than I will ever know on the subject, and I’m only posting my uninformed take here in hopes somebody will correct me in a way I understand.
fasttext can learn words it haven't seen before by combining words from ngrams, word2vec can learn better meaning of the whole words, but then missing out on the "unknown words".
image tokens are "text2vec" here, while text tokens are a proxy towards building a text embedding of even unseen before texts.
User: cna uto nfeir waht tihs si msieplnig?
Google: Here are top web results for exploring this topic [few websites containing or discussing gibberish in general]
User: what was my previous question?
Google: Your previous question was "cna uto nfeir waht tihs si msieplnig?". I was able to understand this through a combination of contextual understanding and sub-word tokenization, which allowed me to infer the intended meaning despite the numerous typos. The corrected question is: "Can you infer what this is misspelling?".
Sort of at the level of word2vec, where the representation of "flies" in "fruit flies like a banana" vs "time flies like an arrow" would be the same.
I can certainly believe that images bring certain advantages over text for LLMs: the image representation does contain useful information that we as humans use (like better information hierarchies encoded in text size, boldness, color, saturation and position, not just n levels of markdown headings), letter shapes are already optimized for this kind of encoding, and continuous tokens seem to bring some advantages over discrete ones. But none of these advantages need the roundtrip via images, they merely point to how crude the state of the art of text tokenization is
[0] https://web.archive.org/web/20140402025221/http://m.nautil.u...
onesandofgrain•2h ago
hshdhdhehd•2h ago
genghisjahn•2h ago