frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Should LLMs just treat text content as an image?

https://www.seangoedecke.com/text-tokens-as-image-tokens/
48•ingve•6d ago

Comments

onesandofgrain•2h ago
A picture is worth a thousand words
hshdhdhehd•2h ago
And a picture of a thousand words is worth a thousand words.
genghisjahn•2h ago
I type at .08 pictures per minute.
Havoc•2h ago
Seems wildly counterintuitive to me frankly.

Even if true though not sure what we’d do with it. The bulk of knowledge available on the internet is text. Aside from maybe YouTube so I guess it could work for world model type things? Understanding physical interactions of objects etc

hshdhdhehd•1h ago
Trivial to convert text to images to process. But counter-intuitive to me too.
bilsbie•1h ago
All text is technically converted to images before we see it.
thfuran•34m ago
Only if you see it instead of hearing it or touching it.
ToJans•2h ago
A series of tokens is one-dimensional (a sequence). An image is 2-dimensional. What about 3D/4D/... representation (until we end up with an LLM-dimensional solution ofc).
dvt•2h ago
This isn't exactly true, as tokens live in the embedding space, which is n-dimensional, like 256 or 512 or whatever (so you might see one word, but it's actually an array of a bunch of numbers). With that said, I think it's pretty intuitive that continuous tokens are more efficient than discrete ones, simply due to the fact that the LLM itself is basically a continuous function (with coefficients/parameters ∈ ℝ).
wongarsu•1h ago
We call an embedding-space n-dimensional, but in this context I would consider it 1-dimensional, as in it's a 1d vector of n values. The terminology just sucks. If we described images the same way we describe embeddings a 2 megapixel image would have to be called 2-million-dimensional (or 8-million-dimensional if we consider rgba to be four separate values)

I would also argue tokens are outside the embedding space, and a large part of the magic of LLMs (and many other neural network types) is the ability to map sequences of rather crude inputs (tokens) into a more meaningful embedding space, and then map from a meaningful embedding space back to tokens we humans understand

bonsai_spool•2h ago
This doesn’t cite the very significant example of DeepVariant (and as of 10/16/25 DeepSomatic) which convert genomic data to images in order to find DNA mutations. This has been done since the late 2010s

https://google.github.io/deepvariant/posts/2020-02-20-lookin...

LysPJ•2h ago
Andrej Karpathy made an interesting comment on the same paper: https://x.com/karpathy/status/1980397031542989305
onionisafruit•6m ago
> It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all

This goes against my limited understanding of how LLMs work — and computers generally for that matter. Isn’t that rendering of a smiling emoji still just a series of bits that need to be interpreted as a smiley face? The similar looking characters point makes more sense to me though assuming it’s something along the lines of recognizing that “S” and “$” are roughly the same thing except for the line down the middle. Still that seems like something that doesn’t come up much and is probably covered by observations made in the training corpus.

All that said, Karpathy knows way more than I will ever know on the subject, and I’m only posting my uninformed take here in hopes somebody will correct me in a way I understand.

themoxon•1h ago
There's a new paper from ICCV which basically tries to render every modality as images: https://openaccess.thecvf.com/content/ICCV2025/papers/Hudson...
pcwelder•1h ago
I ϲаn guаrаntее thаt thе ОСR ϲаn't rеаd thіs sеntеnсе ϲоrrесtlу.
geysersam•1h ago
Really? How so?
moduspol•1h ago
Looks like he’s using atypical “c” characters.
syntaxing•1h ago
What’s correct though? Even as a human, I read that “correctly”. Using weird representations of C doesn’t change the word?
metalliqaz•1h ago
Yeah OCR would be much more likely to read that sentence the way a human would.
LudwigNagasena•1h ago
I would even say that OCR can rеаd the sеntеnсе ϲоrrесtlу, while a tokenizer can't.
bitdivision•55m ago
A lot of Cyrillic characters: https://apps.timwhitlock.info/unicode/inspect?s=I+%CF%B2%D0%...
vindex10•1h ago
reminds me of the difference between fasttext and word2vec.

fasttext can learn words it haven't seen before by combining words from ngrams, word2vec can learn better meaning of the whole words, but then missing out on the "unknown words".

image tokens are "text2vec" here, while text tokens are a proxy towards building a text embedding of even unseen before texts.

mohsen1•1h ago
My understanding is that text tokens are too rigid. The way we read is not to process each character (tokens for LLMs) precociously but to see a word or sometimes a collection of familiar words and make sense of writing. That concept that we understand from written text is really what we read and not letter or words exactly. This is why we can easily read written text with typos. They are just similar enough. By letting LLMs not to be too hung up on exact tokens and "skim" through text we can make them more efficient just like how humans efficiently read.
bonsai_spool•1h ago
You miss that we already have ‘context’ when we begin reading something, and that probably enables our fast reading. Maybe there’s a way to give that background setting information to an llm but then we could also just have it read the entire input stream
lupire•1h ago
Google AI mode:

User: cna uto nfeir waht tihs si msieplnig?

Google: Here are top web results for exploring this topic [few websites containing or discussing gibberish in general]

User: what was my previous question?

Google: Your previous question was "cna uto nfeir waht tihs si msieplnig?". I was able to understand this through a combination of contextual understanding and sub-word tokenization, which allowed me to infer the intended meaning despite the numerous typos. The corrected question is: "Can you infer what this is misspelling?".

Miraltar•11m ago
I couldn't read it
nacozarina•1h ago
the enshittifiers simply haven't yet weighted image processing fees with potential token charges; once they have, your cost advantage goes bye bye
ghoul2•1h ago
But does this not miss the "context" that the embeddings of the text tokens carry? An LLM embedding of a text token has a compressed version of the entire preceding set of tokens that came before it in the context. While the image embeddings are just representations of pixel values.

Sort of at the level of word2vec, where the representation of "flies" in "fruit flies like a banana" vs "time flies like an arrow" would be the same.

wongarsu•1h ago
Look long enough at literature on any machine learning task, and someone invariably gets the idea to turn the data into an image and do machine learning on that. Sometimes it works out (turning binaries into images and doing malware detection with a CNN surprisingly works), usually it doesn't. Just like in this example the images usually end up as a kludge to fix some deficiency in the prevalent input encoding.

I can certainly believe that images bring certain advantages over text for LLMs: the image representation does contain useful information that we as humans use (like better information hierarchies encoded in text size, boldness, color, saturation and position, not just n levels of markdown headings), letter shapes are already optimized for this kind of encoding, and continuous tokens seem to bring some advantages over discrete ones. But none of these advantages need the roundtrip via images, they merely point to how crude the state of the art of text tokenization is

pegasus•32m ago
Exactly. The example the article give of reducing resolution as a form of compression highlights the limitations of the visual-only proposal. Blurring text is a poor form of compression, preserving at most information about paragraph sizes. Summarizing early paragraphs (as context compression does in coding agents) would be much more efficient.
vanderZwan•16m ago
Doesn't this more or less boil down to OCR scans of books having more privileged information than a plaintext file? In which case a roundtrip won't add anything?

[0] https://web.archive.org/web/20140402025221/http://m.nautil.u...

aitchnyu•1h ago
The amount of video/imagery to make a million tokens vs the amount of text to do the same is a surprisingly low ratio. Did they have the same intuition?
metalliqaz•1h ago
Future headline: "The unreasonable effectiveness of text encoding"
leemcalilly•42m ago
and reading (aka “ocr”) is the fastest way for the brain to process language.
qiine•38m ago
or maybe 3d objects, since that's closer to what real life is and how the brain shaped itself around?
mannykannot•26m ago
Language was spoken long before it was written (or so it seems.) This article almost suggests that sound might be a superior input medium over either digital text or images.
falcor84•8m ago
I've never worked in that area, but recall reading about how images of spectrograms are often superior inputs to neural nets in comparison to the raw audio data.

Rust cross-platform GPUI components

https://github.com/longbridge/gpui-component
160•xvilka•3h ago•56 comments

Recall for Linux

https://github.com/rolflobker/recall-for-linux
264•anticensor•5h ago•101 comments

Don't forget these tags to make HTML work like you expect

https://blog.jim-nielsen.com/2025/dont-forget-these-html-tags/
72•FromTheArchives•3h ago•35 comments

Should LLMs just treat text content as an image?

https://www.seangoedecke.com/text-tokens-as-image-tokens/
48•ingve•6d ago•37 comments

Microsoft needs to open up more about its OpenAI dealings

https://www.wsj.com/tech/ai/microsoft-needs-to-open-up-more-about-its-openai-dealings-59102de8
98•zerosizedweasle•1h ago•53 comments

WorldGrow: Generating Infinite 3D World

https://github.com/world-grow/WorldGrow
38•cdani•3h ago•25 comments

Why I'm teaching kids to hack computers

https://www.hacktivate.app/why-teach-kids-to-hack
113•twostraws•4d ago•43 comments

Geoutil.com – Measure distances, areas, and convert geo data in the browser

https://geoutil.com
57•FreeGuessr•6d ago•10 comments

Artifact (YC W25) is hiring engineers in NYC to build modern ECAD

1•antonysamuel•1h ago

Corrosion

https://fly.io/blog/corrosion/
26•cgb_•4d ago•5 comments

How I turned Zig into my favorite language to write network programs in

https://lalinsky.com/2025/10/26/zio-async-io-for-zig.html
258•0x1997•13h ago•92 comments

Show HN: Write Go code in JavaScript files

https://www.npmjs.com/package/vite-plugin-use-golang
88•yar-kravtsov•7h ago•26 comments

Unexpected patterns in historical astronomical observations

https://www.su.se/english/news/unexpected-patterns-in-historical-astronomical-observations-1.855042
13•XzetaU8•3d ago•1 comments

What Happened to Running What You Wanted on Your Own Machine?

https://hackaday.com/2025/10/22/what-happened-to-running-what-you-wanted-on-your-own-machine/
144•marbartolome•4h ago•63 comments

If your adversary is the mossad (2014) [pdf]

https://www.usenix.org/system/files/1401_08-12_mickens.pdf
162•xeonmc•4h ago•105 comments

Structure and Interpretation of Classical Mechanics (2014)

https://tgvaughan.github.io/sicm/toc.html
55•the-mitr•8h ago•17 comments

You already have a Git server

https://maurycyz.com/misc/easy_git/
575•chmaynard•1d ago•379 comments

Show HN: MyraOS – My 32-bit operating system in C and ASM (Hack Club project)

https://github.com/dvir-biton/MyraOS
204•dvirbt•16h ago•42 comments

Sandhill cranes have adopted a Canada gosling

https://www.smithsonianmag.com/science-nature/these-sandhill-cranes-have-adopted-a-canadian-gosli...
113•NaOH•4d ago•29 comments

Ken Thompson recalls Unix's rowdy, lock-picking origins

https://thenewstack.io/ken-thompson-recalls-unixs-rowdy-lock-picking-origins/
188•dxs•20h ago•32 comments

You are how you act

https://boz.com/articles/you-are-how-you-act
110•HiPHInch•1h ago•77 comments

An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

https://github.com/noamteyssier/hist-rs
82•noamteyssier•4d ago•57 comments

Sphere Computer – The Innovative 1970s Computer Company Everyone Forgot

https://sphere.computer/
75•ChrisArchitect•3d ago•7 comments

Are-we-fast-yet implementations in Oberon, C++, C, Pascal, Micron and Luon

https://github.com/rochus-keller/Are-we-fast-yet
72•luismedel•14h ago•19 comments

A definition of AGI

https://arxiv.org/abs/2510.18212
254•pegasus•19h ago•410 comments

A bug that taught me more about PyTorch than years of using it

https://elanapearl.github.io/blog/2025/the-bug-that-taught-me-pytorch/
405•bblcla•3d ago•76 comments

We saved $500k per year by rolling our own "S3"

https://engineering.nanit.com/how-we-saved-500-000-per-year-by-rolling-our-own-s3-6caec1ee1143
251•mpweiher•16h ago•199 comments

Feed the bots

https://maurycyz.com/misc/the_cost_of_trash/
246•chmaynard•1d ago•172 comments

Why JPEG XL Ignoring Bit Depth Is Genius (and Why AVIF Can't Pull It Off)

https://www.fractionalxperience.com/ux-ui-graphic-design-blog/why-jpeg-xl-ignoring-bit-depth-is-g...
77•Bogdanp•5h ago•44 comments

Asbestosis

https://diamondgeezer.blogspot.com/2025/10/asbestosis.html
270•zeristor•1d ago•200 comments