A text encoding uses 8bits per character on average, tokenization further compresses that
An image font would be 25 bits if 5x5, and most fonts are 12 pixels high
Of course it isn't efficient, this is a pricing inefficiency and a hack to exploit it (even the author describes it as an exploit)
Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.
DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]
Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.
[1] https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...
Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years of multi-trillion dollar growth
or
Anthropic's marketing team might not have priced images on par with text in their rush to drive growth via money losing offerings
It kinda makes sense too. Because while people do read code word by word, we often "glance over" it and do roughly pattern recognition on it to know what it does. Only homing in on something when we need to answer a specific question. I think humans kinda naturally do this exploit anyway
So my guess is that Claude’s backend is doing the same — so this hack is probably more of a loophole in token accounting that might get closed if Claude is doing what Gemini does
input tokens are cheaper than output tokens. seems like it would maybe reduce input tokens at the expense of many more output tokens if you're actually triggering OCR via thinking?
dimitropoulos•2h ago