Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)
DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)
"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
If we had a million times the compute? We might have brute forced our way to AGI by now.
Sequential reading of text is very inefficient.
Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.
You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.
yunwal•1d ago
> Maybe it makes more sense that all inputs to LLMs should only ever be images.
So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?
smegma2•1d ago
rhdunn•2h ago
> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
Italicized emphasis mine.
So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.
fspeech•2h ago
CuriouslyC•2h ago