Author here. I started this project after reading the earlier threads on DeepSeek-OCR [1][2]. I got really excited about "vision for context compression," but after reading their paper, a couple things were bugging me.
They show good OCR results (image → text), but the pitch is context compression (text → image → text). They never actually test that pipeline. So I implemented it: render text, compress to vision tokens, reconstruct. Then I compared against just compressing the text embeddings directly. Mean pooling (averaging embeddings in a sliding window) nearly matched DeepSeek-OCR. A small conv encoder crushed both.
Ok so fine, maybe vision isn't special for reconstruction. But maybe the path matters more than the destination. Do the representations learned through vision work better for language modeling? I finetuned the checkpoints from the reconstruction experiments for next-token prediction. Vision and mean pooling couldn't beat truncation, but the conv encoder could. I didn't do any architecture search. It just worked.
That said, this is preliminary work. I just wanted to answer the obvious next questions. So far, the findings don't support the "vision for context compression" narrative.
atbhtunnm•1h ago
They show good OCR results (image → text), but the pitch is context compression (text → image → text). They never actually test that pipeline. So I implemented it: render text, compress to vision tokens, reconstruct. Then I compared against just compressing the text embeddings directly. Mean pooling (averaging embeddings in a sliding window) nearly matched DeepSeek-OCR. A small conv encoder crushed both.
Ok so fine, maybe vision isn't special for reconstruction. But maybe the path matters more than the destination. Do the representations learned through vision work better for language modeling? I finetuned the checkpoints from the reconstruction experiments for next-token prediction. Vision and mean pooling couldn't beat truncation, but the conv encoder could. I didn't do any architecture search. It just worked.
That said, this is preliminary work. I just wanted to answer the obvious next questions. So far, the findings don't support the "vision for context compression" narrative.
Happy to answer questions.
[1] https://news.ycombinator.com/item?id=45640594 [2] https://news.ycombinator.com/item?id=45658928