frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

https://twitter.com/karpathy/status/1980397031542989305
102•JnBrymn•1d ago

Comments

yunwal•1d ago
> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

smegma2•1d ago
No? He’s talking about rendered text
rhdunn•2h ago
From the post he's referring to text input as well:

> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:

Italicized emphasis mine.

So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.

Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.

fspeech•2h ago
If you can read your input on your screen your computer apparently knows how to convert your texts to images.
CuriouslyC•2h ago
All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.
dang•3h ago
Recent and related:

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)

DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)

sabareesh•2h ago
It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space
ACCount37•2h ago
People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.
typpilol•2h ago
It will require like 20x the compute
Mehvix•29m ago
Why do you suppose this is a compute limited problem?
ACCount37•3m ago
It's kind of a shortcut answer by now. Especially for anything that touches pretraining.

"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.

The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.

A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.

ACCount37•25m ago
A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".

If we had a million times the compute? We might have brute forced our way to AGI by now.

Jensson•15m ago
But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.
kenjackson•20m ago
Why so much compute? Can you tie it to the problem?
CuriouslyC•2h ago
Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.
hbarka•2h ago
Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?
anabis•27m ago
Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.
varispeed•2h ago
Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

sosodev•2h ago
LLMs don't "read" text sequentially, right?
olliepro•1h ago
The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.
Merik•45m ago
Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...
ACCount37•1m ago
Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.

Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.

spiralcoaster•27m ago
What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!
cnxhk•1h ago
The paper is quite interesting but efficiency on OCR tasks does not mean it could be plugged into a general llm directly without performance loss. If you train a tokenizer only on OCR text you might be able to get better compression already.
ianbutler•20m ago
https://arxiv.org/abs/2510.17800 (Glyph: Scaling Context Windows via Visual-Text Compression)

You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.

viraptor•4m ago
https://xcancel.com/karpathy/status/1980397031542989305

Google flags Immich sites as dangerous

https://immich.app/blog/google-flags-immich-as-dangerous
141•janpio•4h ago•35 comments

Ovi: Twin backbone cross-modal fusion for audio-video generation

https://github.com/character-ai/Ovi
238•montyanderson•5h ago•86 comments

Willow quantum chip demonstrates verifiable quantum advantage on hardware

https://blog.google/technology/research/quantum-echoes-willow-verifiable-quantum-advantage/
371•AbhishekParmar•9h ago•176 comments

Mass Assignment Vulnerability Exposes Max Verstappen Passport and F1 Drivers PII

https://ian.sh/fia
224•galnagli•6h ago•51 comments

JMAP for Calendars, Contacts and Files Now in Stalwart

https://stalw.art/blog/jmap-collaboration/
236•StalwartLabs•7h ago•102 comments

Why SSA Compilers?

https://mcyoung.xyz/2025/10/21/ssa-1/
101•transpute•4h ago•37 comments

Scripts I wrote that I use all the time

https://evanhahn.com/scripts-i-wrote-that-i-use-all-the-time/
450•speckx•10h ago•159 comments

Play abstract strategy board games online with friends or against bots

https://abstractboardgames.com/
27•abstractbg•6d ago•4 comments

Element: setHTML() method

https://developer.mozilla.org/en-US/docs/Web/API/Element/setHTML
117•todsacerdoti•16h ago•48 comments

VortexNet: Neural network based on fluid dynamics

https://github.com/samim23/vortexnet
12•vegax87•2h ago•1 comments

Rivian's TM-B electric bike

https://www.theverge.com/news/804157/rivian-tm-b-electric-bike-price-specs-helmet-quad
139•hasheddan•7h ago•227 comments

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

https://twitter.com/karpathy/status/1980397031542989305
102•JnBrymn•1d ago•26 comments

Common yeast can survive Martian conditions

https://phys.org/news/2025-10-common-yeast-survive-martian-conditions.html
49•geox•1w ago•24 comments

InpharmD (YC W21) Is Hiring – NLP Engineer

https://inpharmd.com/jobs/inpharmd-is-hiring-ai-ml-engineer
1•tulasichintha•4h ago

YASA beats own power density record pushing electric motor to 59kW/kg benchmark

https://yasa.com/news/yasa-smashes-own-unofficial-power-density-world-record-pushing-state-of-the...
30•breve•4h ago•15 comments

Iceland reports the presence of mosquitoes as climate warms

https://www.npr.org/2025/10/22/nx-s1-5582748/iceland-mosquitoes-first-time
75•sans_souse•2h ago•19 comments

LibCube: Find new sounds from audio synths easier

https://github.com/cslr/libcube-public/wiki
13•cslr•4d ago•3 comments

HP SitePrint

https://www.hp.com/us-en/printers/site-print/layout-robot.html
149•gjvc•7h ago•102 comments

Show HN: Cuq – Formal Verification of Rust GPU Kernels

https://github.com/neelsomani/cuq
36•nsomani•5h ago•28 comments

The first interstellar software update: The hack that saved Voyager 1 [video]

https://www.youtube.com/watch?v=p0K7u3B_8rY
7•daemonologist•1w ago•3 comments

I see a future in jj

https://steveklabnik.com/writing/i-see-a-future-in-jj/
209•steveklabnik•7h ago•122 comments

Cryptographic Issues in Cloudflare's Circl FourQ Implementation (CVE-2025-8556)

https://www.botanica.software/blog/cryptographic-issues-in-cloudflares-circl-fourq-implementation
142•botanica_labs•10h ago•68 comments

I, Sharpie

https://www.commonplace.org/p/chris-griswold-i-sharpie
18•delichon•1w ago•17 comments

Criticisms of “The Body Keeps the Score”

https://josepheverettwil.substack.com/p/the-body-keeps-the-score-is-bullshit
188•adityaathalye•6h ago•201 comments

Enchanting Imposters

https://daily.jstor.org/enchanting-imposters/
8•Petiver•18h ago•1 comments

The Tonnetz

https://thetonnetz.com/
43•mci•4d ago•9 comments

Rethinking CQRS: An Interview on OpenCQRS

https://docs.eventsourcingdb.io/blog/2025/10/23/rethinking-cqrs-an-interview-on-opencqrs/
17•goloroden•4h ago•0 comments

Linux Capabilities Revisited

https://dfir.ch/posts/linux_capabilities/
166•Harvesterify•11h ago•35 comments

Greg Newby, CEO of Project Gutenberg Literary Archive Foundation, has died

https://www.pgdp.net/wiki/In_Memoriam/gbnewby
516•ron_k•16h ago•70 comments

Galaxy XR: The first Android XR headset

https://blog.google/products/android/samsung-galaxy-xr/
161•thelastgallon•8h ago•175 comments