It had a surprisingly high hit rate. It took over 3 hours of LLM calls but who cares - It was completely hands-off. I then compared the invoices to my bank statements (aka I asked an LLM to do it) and it just missed a few invoices that weren't included as attachments (like those "click to download" mails). It did a pretty poor job matching invoices to bank statements (like "oh this invoice is a few dollars off but i'm sure its this statement") so I'm afraid I still need an accountant for a while.
"What did it cost"? I don't know. I used a cheap-ish model, Claude 3.7 I think.
The biggest problem with direct image extraction is multipage documents. We found that single page extraction (OCR=>LLM vs Image=LLM) slightly favored the direct image extraction. But anything beyond 5 images had a sharp fall off in accuracy compared to OCR first.
Which makes sense, long context recall over text is already a hard problem, but that's what LLMs are optimized for. Long context recall over images is still pretty bad.
We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.
Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.
Ex: Reading contracts or legal documents. Usually a 50 page document that you can't very effectively cherry pick from. Since different clauses or sections will be referenced multiple times across the full document.
In these scenarios, it's almost always better to pass the full document into the LLM rather than running RAG. And if you're passing the full document it's better as text rather than images.
What am I missing?
Flash 2.5, Sonnet 3.7, etc. always provided me with very satisfactory image analysis. And, I might be making this up, but to me it feels like some models provide better responses when I give them the text as an image, instead of feeding "just" the text.
You need to apply things like quantization, single-vector conversions (using fixed dimensional encodings), and better indexing to ensure that multimodal RAG works at scale.
That is exactly what we're doing at Morphik :)
1. Have the LLM see the image and produce an text version using a kind of semantic markup (even hallucinated markup)
2. Use that text for most of the RAG
3. If the focus (of analysis or conversation) converges one image, include that image in the context in addition to the text
If I use a simple prompt with GPT 4o on the Palantir slide from the article I get this: https://gist.github.com/ianb/7a380a66c033c638c2cd1163ea7b2e9... – seems pretty good!
There are cases where documents contains text with letters that look the same in many font. For example, 0 and O looks identical in many fonts. So if you have a doc/xls/PDF/html then you lose information by converting it into an image.
For cases like serial numbers, not even humans can distinguish 0 vs O (or l vs I) by looking at them.
If the OCR has a problem understanding varying fonts and text, there is zero reason using embeddings instead is immune to this.
For that reason, IMO rendering a PDF page as an image is a very reasonable way to extract information out of it.
For the other formats you mentioned, I agree that it is probably better to parse the document instead.
Yeah, but when they do, it makes a difference.
Also, speaking from experience, most invoices do contain actual text.
1 vs I or 0 vs O are valid issues, but in practice - and there's probably selection bias here - we've seen documents with a ton of diagrams and charts (that are much simpler to deal with as images).
- LLM's are typically pre-trained on 4k text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images.
- Pdf's at 1536 × 2048 use 3 to 5X more tokens than the raw text (ie higher inference costs and slower responses). Going lower results in blurry images.
- Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.
Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.
An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort.
Definitely trade-offs to be made here, we found this to be the most effective in most cases.
An interesting property of the gemma3 family is that increasing the input image siwmze actually does not increase processing memory requirements, because a second stage encoder actually compresses it into fixed size tokens. Very neat in practice.
"There's multiple fundamental problems people need to be aware of. - LLM's are typically pre-trained on text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images. - A PNG at 512x 2048 is 3.5k more tokens than the raw text (so higher inference costs and slower responses). Going lower results in blurry images. - Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.
Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.
An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort."
It mostly gets it right but notice it changes "Pdf's at 1536 × 2048 use 3 to 5X more tokens" to "A PNG at 512x 2048 is 3.5k more tokens".
Ignorging the trade-offs introduced, the MUVERA paper presented a drop of 90% in latency with evidence in the form of a research paper. Yet, you are reporting "99%" drops in latency. Big claims require big evidence.
Patents are difficult as they can include anything from abstract diagrams, chemical formulas, to mathematical equations, so it tends to be really tricky to prepare the data in a way that later can be used by an LLM.
The simplest approach I found was to “take a picture” of each page of the document, and ask for an LLM to generate a JSON explaining the content (plus some other metadata such as page number, number of visual elements, and so on)
If any complicated image is present, simply ask for the model to describe it. Once that is done, you have a JSON file that can be embedded into your vector store of choice.
I can’t say about the price-to-performance ration, but this approach seems to easier and more efficient than what is the author is proposing.
But it also illustrates to me that the opportunities with LLMs right now are primarily about reclassifying or reprocessing existing sources of value like patent documents. In the 90-00s many successful SW businesses were building databases to replace traditional filing.
Creating fundamentally new collections of value which require upfront investment seems to still be challenging for our economy.
Has anyone done any work to evaluate how good LLM parsing is compared to traditional OCR? I've only got anecdotal evidence saying LLMs are better. However whenever I've tested it out there were always an unacceptable level of hallucinations.
I’m so sick of this.
In our current setup, we have the multivectors stored as .npy in S3 Express storage. We use Turbopuffer for the vector search + filtering part. Pre-warming the namespace, and pre-fetching the most common vectors from S3 means that the search latency is almost indistinguishable from regular vector search.
ColPali with binary vectors worked fine, but to be honest there have been so many specific improvements to single vectors that switching to MUVERA gave us a huge boost.
Regular multivector ColPali also suffers from a similar issue. Chamfer distance is just hard to compute at scale. Plaid is a good solution if your corpus is constant. If it isn't, using the regular mulitvector ColPali as a re-ranking step is a good bet.
LLMs are great, but use the right tool for the job.
In the context of RAG, the objective is to send information to the model, so LLMs are the right tool for the job.
After all, the documents/emails/presentations will cover the most common use cases. But we have databases that have all the questions the RAG might be asked, far more answers than that which live in documents.
pilooch•7h ago
It's open source and available here: https://github.com/jolibrain/colette
It's not our primary business so it's just lying there and we don't advertise much, but it works, somehow and with some tweaks to get it really efficient.
The true genius though is that the whole thing can be made fully differentiable, unlocking the ability to finetune the viz rag on targeted datasets.
The layout model can also be customized for fine grained document understanding.
Adityav369•7h ago
Often, the blocker becomes high quality eval sets (which I guess always is the blocker).
ted_dunning•6h ago
JSR_FDED•6h ago
wryun•5h ago
I agree it's better to have the full licence at top level, but is there a legal reason why this would be inadequate?
pilooch•5h ago
deadbabe•4h ago