Can you save on LLM tokens using images instead of text?

https://pagewatch.ai/blog/post/llm-text-as-image-tokens/

48•lpellis•3mo ago

Comments

bikeshaving•3mo ago

Does this mean we’ll finally get empirical proof for the aphorism “a picture is worth a thousand words”?

https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_...

heltale•3mo ago

I suppose it’s only worth 256 words at a time right now. ;)

https://arxiv.org/abs/2010.11929

estebarb•3mo ago

The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED

bikeshaving•3mo ago

2.4% relative error is not bad.

pastor_williams•3mo ago

Reminds me of Babbage making allowance for meter.

"""

    ... it is said that he [Babbage] sent the following letter to Alfred, Lord Tennyson about a couplet in "The Vision of Sin":

         Every minute dies a man,
         Every minute one is born

    I need hardly point out to you that this calculation would tend to keep the sum total of the world's population in a state of perpetual equipoise, whereas it is a well-known fact that the said sum total is constantly on the increase. I would therefore take the liberty of suggesting that in the next edition of your excellent poem the erroneous calculation to which I refer should be corrected as follows:

         Every minute dies a man,
         And one and a sixteenth is born

    I may add that the exact figures are 1.167, but something must, of course, be conceded to the laws of metre.

"""

    Charles Babbage and his Calculating Engines

zahlman•3mo ago

Wouldn't "one and a sixth" be more accurate in both respects?

cbhl•3mo ago

Shouldn't it be the other way around if the population is increasing? Every minute one is born = 1440 born/day, every minute and a sixteenth ~= 1335 dead/day for a net population increase of 105/day.

BrenBarn•2mo ago

It means that in every minute, one and a sixteenth of a man is born.

behnamoh•3mo ago

how do you decompress all those 4 words from one token?

estebarb•3mo ago

Not from one token, from one embedding. Text contains a low amount of information: it is possible to compress a few token embeddings into a single tiken embedding.

The how is variable. The calm paper seems to have used a MLP to compress from and ND input (N embeddings of size D) into a single D embedding and other for decompress them back

HarHarVeryFunny•3mo ago

The mechanism would be prediction (learnt during training), not decompression.

It's the same as LLMs being able to "decode" Base64, or work with sub-word tokens for that matter, it just learns to predict that:

<compressed representation> will be followed by (or preceded by) <decompressed representation>, or vice versa.

floodfx•3mo ago

Why are completion tokens more with image prompts yet the text output was about the same?

Garlef•3mo ago

"Thinking" Mode

nunodonato•3mo ago

it doesn't say that anywhere.

cma•3mo ago

Some multimodal models may have a hidden captioning step that may take completion tokens, others work on a fully native representation, and some do both I think.

ashed96•3mo ago

In my experience, LLMs tend to take noticeably longer to process images than text.

psadri•3mo ago

I wonder if these stay in the prefix cache?

weird-eye-issue•3mo ago

It has to get the image data first, basically just IO time before processing it

ashed96•3mo ago

IIRC there's pre-processing (embedding/tokenization?) before feeding images to LLMs?

Hit this issue optimizing LLM request times. Ending up lowering image resolution. Lost some accuracy but could bear that.

GPT-5.3-Codex System Card [pdf]

Atlas: Manage your database schema as code

Geist Pixel

Show HN: MCP to get latest dependency package and tool versions

The better you get at something, the harder it becomes to do

Show HN: WP Float – Archive WordPress blogs to free static hosting

Show HN: I Hacked My Family's Meal Planning with an App

Sony BMG copy protection rootkit scandal

The Future of Systems

NASA now allowing astronauts to bring their smartphones on space missions

Claude Code Is the Inflection Point

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

AI Agent Automates Google Stock Analysis from Financial Reports

Voxtral Realtime 4B Pure C Implementation

I Was Trapped in Chinese Mafia Crypto Slavery [video]

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

Study of 150 developers shows AI generated code no harder to maintain long term

Spotify now requires premium accounts for developer mode API access

When Albert Einstein Moved to Princeton

Agents.md as a Dark Signal

System time, clocks, and their syncing in macOS

McCLIM and 7GUIs – Part 1: The Counter

So whats the next word, then? Almost-no-math intro to transformer models

Ed Zitron: The Hater's Guide to Microsoft

UK infants ill after drinking contaminated baby formula of Nestle and Danone

Show HN: Android-based audio player for seniors – Homer Audio Player