frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
143•theblazehen•2d ago•42 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
668•klaussilveira•14h ago•202 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
949•xnx•19h ago•551 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
122•matheusalmeida•2d ago•33 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
53•videotopia•4d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
229•isitcontent•14h ago•25 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
16•kaonwarb•3d ago•19 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
28•jesperordrup•4h ago•16 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
223•dmpetrov•14h ago•117 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
330•vecti•16h ago•143 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
494•todsacerdoti•22h ago•243 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
381•ostacke•20h ago•95 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
359•aktau•20h ago•181 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
288•eljojo•17h ago•169 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
412•lstoll•20h ago•278 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
19•bikenaga•3d ago•4 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
63•kmm•5d ago•6 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
90•quibono•4d ago•21 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
256•i5heu•17h ago•196 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
32•romes•4d ago•3 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
44•helloplanets•4d ago•42 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
12•speckx•3d ago•5 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
59•gfortaine•12h ago•25 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
33•gmays•9h ago•12 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1066•cdrnsf•23h ago•446 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
150•vmatsiiako•19h ago•67 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
288•surprisetalk•3d ago•43 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
149•SerCe•10h ago•138 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
183•limoce•3d ago•98 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
73•phreda4•13h ago•14 comments
Open in hackernews

Don't bother parsing: Just use images for RAG

https://www.morphik.ai/blog/stop-parsing-docs
328•Adityav369•6mo ago

Comments

pilooch•6mo ago
Some colleagues and myself did implemented exactly this six months ago for a French gov agency.

It's open source and available here: https://github.com/jolibrain/colette

It's not our primary business so it's just lying there and we don't advertise much, but it works, somehow and with some tweaks to get it really efficient.

The true genius though is that the whole thing can be made fully differentiable, unlocking the ability to finetune the viz rag on targeted datasets.

The layout model can also be customized for fine grained document understanding.

Adityav369•6mo ago
Yeah the fine tuning is definitely the best part.

Often, the blocker becomes high quality eval sets (which I guess always is the blocker).

ted_dunning•6mo ago
You don't have a license in your repository top-level. That means that nobody who takes licensing at all seriously can use your stuff, even just for reference.
JSR_FDED•6mo ago
Great, thanks for sharing your code. Could you please add a license so I and others can understand if we're able to use it?
wryun•6mo ago
They do have: https://github.com/jolibrain/colette/blob/main/pyproject.tom...

I agree it's better to have the full licence at top level, but is there a legal reason why this would be inadequate?

pilooch•6mo ago
Good catch, will add it tomorrow. License is Apache2.
deadbabe•6mo ago
Standard practice now is to just have an LLM read the whole repo and write a new original version in a different language. It’s code laundering.
tobyhinloopen•6mo ago
This is something I've done as well - I wanted to scan all invoices that came into my mail so I just exported ALL ATTACHMENTS from my mailbox and used a script to upload them one by one, forcing a tool call to extract "is invoice: yes / no" and a bunch of invoice line, company name, date, invoice number, etc fields.

It had a surprisingly high hit rate. It took over 3 hours of LLM calls but who cares - It was completely hands-off. I then compared the invoices to my bank statements (aka I asked an LLM to do it) and it just missed a few invoices that weren't included as attachments (like those "click to download" mails). It did a pretty poor job matching invoices to bank statements (like "oh this invoice is a few dollars off but i'm sure its this statement") so I'm afraid I still need an accountant for a while.

"What did it cost"? I don't know. I used a cheap-ish model, Claude 3.7 I think.

taberiand•6mo ago
In your use case, for that simple data matching that it errors on I think it would be better to have the LLM write the code that can be used to process the input files (the raw text that it produced from images and the bank statements), rather than have the LLM try to match up the data in the files itself.
abc03•6mo ago
Related question: what is today‘s best solution for invoices?
ArnavAgrawal03•6mo ago
This would depend on the exact use case. Feeding in the invoice directly to the model is - in my opinion - the best way to approach this. If you need to search over them, then directly embedding them as images is definitely a strong approach. Here's something we wrote explaining the process: https://www.morphik.ai/docs/concepts/colpali
themanmaran•6mo ago
Hey we've done a lot of research on this side [1] (OCR vs direct image + general LLM benchmarking).

The biggest problem with direct image extraction is multipage documents. We found that single page extraction (OCR=>LLM vs Image=LLM) slightly favored the direct image extraction. But anything beyond 5 images had a sharp fall off in accuracy compared to OCR first.

Which makes sense, long context recall over text is already a hard problem, but that's what LLMs are optimized for. Long context recall over images is still pretty bad.

[1] https://getomni.ai/blog/ocr-benchmark

ArnavAgrawal03•6mo ago
That's an interesting point. We've found that for most use cases, over 5 pages of context is overkill. Having a small LLM conversion layer on top of images also ends up working pretty well (i.e. instead of direct OCR, passing batches of 5 images - if you really need that many - to smaller vision models and having them extract the most important points from the document).

We're currently researching surgery on the cache or attention maps for LLMs to have larger batches of images work better. Seems like Sliding window or Infinite Retrieval might be promising directions to go into.

Also - and this is speculation - I think that the jump in multimodal capabilities that we're seeing from models is only going to increase, meaning long-context for images is probably not going to be a huge blocker as models improve.

themanmaran•6mo ago
This just depends a lot on how well you can parse down the context prior to passing to an LLM.

Ex: Reading contracts or legal documents. Usually a 50 page document that you can't very effectively cherry pick from. Since different clauses or sections will be referenced multiple times across the full document.

In these scenarios, it's almost always better to pass the full document into the LLM rather than running RAG. And if you're passing the full document it's better as text rather than images.

jasonthorsness•6mo ago
It makes sense that a lossy transformation (OCR which removes structure) would be worse than perceptually lossless (because even if the PDF file has additional information, you only see the rendered visual). But it's cool and a little surprising that the multi-modal models are getting this good at interpreting images!
emanuer•6mo ago
Could someone please help me understand how a multi-modal RAG does not already solve this issue?[1]

What am I missing?

Flash 2.5, Sonnet 3.7, etc. always provided me with very satisfactory image analysis. And, I might be making this up, but to me it feels like some models provide better responses when I give them the text as an image, instead of feeding "just" the text.

[1] https://www.youtube.com/watch?v=p7yRLIj9IyQ

ArnavAgrawal03•6mo ago
Multimodal RAG is exactly what we argue for. In their original state, though, multivectors (that form the basis for multi-modal RAG) are very unwieldy - computing the similarity scores is very expensive and so scaling them up in this state is hard.

You need to apply things like quantization, single-vector conversions (using fixed dimensional encodings), and better indexing to ensure that multimodal RAG works at scale.

That is exactly what we're doing at Morphik :)

barrenko•6mo ago
And the Gemini(s) aren't already doing this at GoogleCorp?
urbandw311er•6mo ago
Something just feels a bit off about this piece. It seems to labour the point about how “beautiful” or “perfect” their solution is a few times too many, to the point where it starts to feel more like marketing than any sort of useful technical observation.
programjames•6mo ago
I disagree. It feels like something you would say when you finally come across the "obviously right" solution, that's easier to implement and simpler to describe. As Kolmogorov said, the simplest solution is exponentially more correct than the others.
bravesoul2•6mo ago
It is marketing of course. Regardless of what it says it's a company blog. That sets constraints on the sort of stuff they say vs. a regular blog. Not picking on this company as it is the same for all such blogs.
ianbicking•6mo ago
Using modern tools I would naturally be inclined to:

1. Have the LLM see the image and produce an text version using a kind of semantic markup (even hallucinated markup)

2. Use that text for most of the RAG

3. If the focus (of analysis or conversation) converges one image, include that image in the context in addition to the text

If I use a simple prompt with GPT 4o on the Palantir slide from the article I get this: https://gist.github.com/ianb/7a380a66c033c638c2cd1163ea7b2e9... – seems pretty good!

ashishb•6mo ago
I speak from experience that this is a bad idea.

There are cases where documents contains text with letters that look the same in many font. For example, 0 and O looks identical in many fonts. So if you have a doc/xls/PDF/html then you lose information by converting it into an image.

For cases like serial numbers, not even humans can distinguish 0 vs O (or l vs I) by looking at them.

weego•6mo ago
This is within the context of using it as an alternative to OCR, which would suffer the same issues, with more duct tape and string infrastructure and cost.
ashishb•6mo ago
You can win any race if you can cherry-pick your competitors.
llm_nerd•6mo ago
Strangely the linked marketing text repeatedly comments regarding OCR errors (I counted at least 4 separate instances), which is extremely weird because such a visual RAG suffers precisely the same problem. It is such a weird thing to repeatedly harp on.

If the OCR has a problem understanding varying fonts and text, there is zero reason using embeddings instead is immune to this.

vlovich123•6mo ago
I’m confused. Wouldn’t the LLM be able to read the text more correctly than traditional OCR by virtue of inferring what that looks like vs what makes sense for it to look like from training? I would think it would be less prone to making fewer typographic interpretation errors than a more traditional mechanical algorithm.
llm_nerd•6mo ago
Modern OCR is using machine learning technologies, including ViT and precisely the same models and technologies used in the linked solution. I mean, if their comparison was with OCR from 2002, sure, but they're comparing against modern OCR solutions that generate text representations of documents, using the very latest machine learning innovations and massive models (along with textual transformer-based contextual inferrals), with their own solution which uses precisely the same stack. It's a weird thing for them to continually harp on.

Their solution is precisely as subject to ambiguities of text that the comparative OCR solutions are.

zffr•6mo ago
PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

For that reason, IMO rendering a PDF page as an image is a very reasonable way to extract information out of it.

For the other formats you mentioned, I agree that it is probably better to parse the document instead.

ArnavAgrawal03•6mo ago
Completely agree with this. This is what we've observed in production too. Embedding images makes the RAG a lot more robust to the "inner workings" of a document.
ashishb•6mo ago
> PDFs don’t always contain actual text. Sometimes they just contain instructions to draw the letters.

Yeah, but when they do, it makes a difference.

Also, speaking from experience, most invoices do contain actual text.

barrenko•6mo ago
The more I learn about PDF, the more I am : what?
fc417fc802•6mo ago
It makes sense. If you "print" to pdf it makes far more sense to keep the vector representation around. Rasterizing it would simultaneously bloat the file size and lower the quality level when transformed.
ArnavAgrawal03•6mo ago
For HTML, in a lot of cases, using the tags to chunk things better works. However, I've found that when I'm trying to design a page, showing models the actual image of the page leads to way better debugging than just sending the code back.

1 vs I or 0 vs O are valid issues, but in practice - and there's probably selection bias here - we've seen documents with a ton of diagrams and charts (that are much simpler to deal with as images).

serjester•6mo ago
There's multiple fundamental problems people need to be aware of.

- LLM's are typically pre-trained on 4k text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images.

- Pdf's at 1536 × 2048 use 3 to 5X more tokens than the raw text (ie higher inference costs and slower responses). Going lower results in blurry images.

- Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort.

ArnavAgrawal03•6mo ago
You can add OCR with Gemini, and presumably that would lead to better results than the OCR model we compared against. However, it's important to note that then you're guaranteeing that the entire corpus of documents you're processing will go through a large VLM. That can be prohibitively expensive and slow.

Definitely trade-offs to be made here, we found this to be the most effective in most cases.

serjester•6mo ago
VLM’s capable of parsing images with high fidelity are 10 - 50X cheaper than the frontier models. Any savings from not parsing, are quickly going to be wiped out if someone has any actual traffic. Not to mention the massive hits to long context accuracy and latency.
pilooch•6mo ago
True but modern models such as gemma3 pan& scan and other tricks such as training from multiple resolutions do alleviate these issues.

An interesting property of the gemma3 family is that increasing the input image siwmze actually does not increase processing memory requirements, because a second stage encoder actually compresses it into fixed size tokens. Very neat in practice.

CGamesPlay•6mo ago
This makes sense, but is something to shaking up the RAG pipeline? Perhaps you could take each RAG result and then do a model processing step to ask it to extract relevant information from the image directly pertaining to the user query, once per result, and then aggregate those (text) results as the input to your final generation. That would sidestep the token limit for multiple images, and allow parallelizing the image understanding step.
tom_m•6mo ago
That's what their document parse product is for. I think people feed things to an LLM sometimes and sure it might work but it could also be the wrong tool for the job. Not everything needs to run through the LLM.
hdjrudni•6mo ago
LLMs are exactly the tool to use when other parsing methods fail due to poor formatting. AI is for the fuzzy cases.
joegibbs•6mo ago
I think it would be good to combine traditional OCR with an LLM to fix up mistakes and add diagram representations - LLMs have the problem of just inventing plausible-sounding text if it can't read it, which is worse than just garbling the result. For instance, GPT4.1 worked perfectly with a screenshot your comment at 1296 × 179 but if I zoom out to 50% and give it a 650 × 84 screenshot instead, the result is:

"There's multiple fundamental problems people need to be aware of. - LLM's are typically pre-trained on text tokens and then extrapolated out to longer context windows (it's easy to go from 4000 text tokens to 4001). This is not possible with images due to how they're tokenized. As a result, you're out of distribution - hallucinations become a huge problem once you're dealing with more than a couple of images. - A PNG at 512x 2048 is 3.5k more tokens than the raw text (so higher inference costs and slower responses). Going lower results in blurry images. - Images are inherently a much heavier representation in raw size too, you're adding latency to every request to just download all the needed images.

Their very small benchmark is obviously going to outperform basic text chunking on finance docs heavy with charts and tables. I would be far more interested in seeing an OCR step added with Gemini (which can annotate images) and then comparing results.

An end to end image approach makes sense in certain cases (like patents, architecture diagrams, etc) but it's a last resort."

It mostly gets it right but notice it changes "Pdf's at 1536 × 2048 use 3 to 5X more tokens" to "A PNG at 512x 2048 is 3.5k more tokens".

woctordho•6mo ago
Context window extrapolation should work with hierarchical/multi-scale tokenization of images, such as Haar wavelets
jamesblonde•6mo ago
"The results transformed our system, and our query latency went from 3-4s to 30ms."

Ignorging the trade-offs introduced, the MUVERA paper presented a drop of 90% in latency with evidence in the form of a research paper. Yet, you are reporting "99%" drops in latency. Big claims require big evidence.

thor-rodrigues•6mo ago
I spent a good amount of time last year working on a system to analyse patent documents.

Patents are difficult as they can include anything from abstract diagrams, chemical formulas, to mathematical equations, so it tends to be really tricky to prepare the data in a way that later can be used by an LLM.

The simplest approach I found was to “take a picture” of each page of the document, and ask for an LLM to generate a JSON explaining the content (plus some other metadata such as page number, number of visual elements, and so on)

If any complicated image is present, simply ask for the model to describe it. Once that is done, you have a JSON file that can be embedded into your vector store of choice.

I can’t say about the price-to-performance ration, but this approach seems to easier and more efficient than what is the author is proposing.

cheschire•6mo ago
how often has the model hallucinated the image though?
monkeyelite•6mo ago
This is a great example of how to use LLMs thanks.

But it also illustrates to me that the opportunities with LLMs right now are primarily about reclassifying or reprocessing existing sources of value like patent documents. In the 90-00s many successful SW businesses were building databases to replace traditional filing.

Creating fundamentally new collections of value which require upfront investment seems to still be challenging for our economy.

Adityav369•6mo ago
You can ask the model to describe the image, but that is inherently lossy. What if it is a chart and the model gets most x, y pairs, but the user asks about a missing "x" or "y" value. Presenting the image at inference is effective since you're guaranteeing that the LLM is able to answer exactly the user's question. The only blocker here becomes how good retrieval is, and that's a smaller problem to solve. This approach allows us to only solve for passing in relevant context, the rest is taken care of by the LLM, otherwise the problem space expands to correct OCR, parsing, and getting all possible descriptions to images from the model.
bravesoul2•6mo ago
Looks like they cracked it? But I found both OCR and reading the whole page (Open AI various models) has been unusable for scanning a magazine say. And getting which heading is for wheat text.
ArnavAgrawal03•6mo ago
Would love to try our hand at it! We have a couple magazine use cases, but the harder it is, the more fun it is :)
meander_water•6mo ago
> You might still need to convert a document to text or a structured format, that’s essential for syncing information into structured databases or data lakes. In those cases, OCR works (with its quirks), but in my experience passing the original document to an LLM is better

Has anyone done any work to evaluate how good LLM parsing is compared to traditional OCR? I've only got anecdotal evidence saying LLMs are better. However whenever I've tested it out there were always an unacceptable level of hallucinations.

commanderkeen08•6mo ago
> The ColPali model doesn't just "look" at documents. It understands them in a fundamentally different way than traditional approaches.

I’m so sick of this.

zzleeper•6mo ago
In what sense?
commanderkeen08•6mo ago
“It’s not just X, it’s Y” is the calling card of ChatGPT right now.
etk934•6mo ago
Can you report the relative storage requirements for multivector COLPALI vs multivector COPALI with binary vectors vs MUVERA vs a single vector per page? Can your system scale to millions of vectors?
ArnavAgrawal03•6mo ago
Yes! We have a use case in production with over a million pages. MUVERA is good for this, since it is basically akin to regular vector search + re-ranking.

In our current setup, we have the multivectors stored as .npy in S3 Express storage. We use Turbopuffer for the vector search + filtering part. Pre-warming the namespace, and pre-fetching the most common vectors from S3 means that the search latency is almost indistinguishable from regular vector search.

ColPali with binary vectors worked fine, but to be honest there have been so many specific improvements to single vectors that switching to MUVERA gave us a huge boost.

Regular multivector ColPali also suffers from a similar issue. Chamfer distance is just hard to compute at scale. Plaid is a good solution if your corpus is constant. If it isn't, using the regular mulitvector ColPali as a re-ranking step is a good bet.

coyotespike•6mo ago
Wow, this is tempting me to use Morphik to add memory to in terminal AI agents for personal use even. Looks powerful and easy.
ArnavAgrawal03•6mo ago
Would love feedback :)
hdjrudni•6mo ago
I was trying to copy a schedule into Gemini to ask it some questions about it. I struggled with copying and pasting it for several minutes, just wouldn't come out right even though it was already in HTML. Finally gave up, screenshotted it, and then put black boxes over the parts I wanted Gemini to ignore (irrelevant info) and pasted that image in. It worked very well.
tom_m•6mo ago
Is the text flattened? You don't need to run PDFs through OCR if not. The text can be extracted. Even with JavaScript in the web browser. You only need OCR for hand written text or flatted text. Google's document parse can help as well. You could also run significantly cheaper tools on the PDF first. Just sending everything to the LLM is more costly. What about massive PDFs? They won't fit in the context window sometimes or will cost a lot.

LLMs are great, but use the right tool for the job.

ArnavAgrawal03•6mo ago
Our argument in general is that even in the non-flattened cases, we see complex diagrams pop up in documents that won't work with a text-based approach.

In the context of RAG, the objective is to send information to the model, so LLMs are the right tool for the job.

K0balt•6mo ago
Can multimodal llms read the pdf file format to extract text components as well as graphical ones? Because that would seem to me to be the best way to go.
imperfect_light•6mo ago
The emphasis on PDFs for RAG seems like something out of the 1990s. Are there any good frameworks for using RAG if your company doesn't go around creating documents left and right?

After all, the documents/emails/presentations will cover the most common use cases. But we have databases that have all the questions the RAG might be asked, far more answers than that which live in documents.

petesergeant•6mo ago
That's because PDFs are the hard part. If you're starting with small pieces of text, RAG becomes much much easier.
imperfect_light•6mo ago
My question is less about PDFs and more about the notion that all the facts needed for the RAG are in documents. In my experience just a fraction of the questions that might be useful exist in a document somewhere. There must be a variation of RAGs that are pulling not from documents, but from databases using some semantic model.
petesergeant•6mo ago
Sure, but the process for this is laughably easy: you render the text with a minimum amount of text to place it into context, and submit that to whatever your embedding-maker is to get the embedding. You could potentially store the embedding in the same DB row if you have a DB that's happy with vector searches.
ekianjo•6mo ago
I did a bit of work in that space. Its not that simple. models that work with images are not perfect either and often have problem finding the right information. So you trade parsing issues with much more difficult to debug corner cases. At the end of the day, whatever works better should be assessed by your test/validation set.
anshumankmr•6mo ago
Problem is transcription errors will mess things up for sure. With the text, you just do not have to worry about transcription errors. Sure, its a bit tricky handling tables and chunking is a problem as well, but unless my document is more images than text, I would prefer handling it the "old-fashioned" way.
CaptainFever•6mo ago
Interesting article, but this is also an ad for a SaaS.
budududuroiu•6mo ago
I get that ColPali is straightforward and powerful, but document processing still has many advantages:

- lexical retrieval (based on BM25, TFIDF) which is better at capturing specific terms - full-text search

constantinum•6mo ago
LLMs are not yet there for complex and diverse document parsing use cases, especially at an enterprise scale (processing millions of pages).

Some of the reasons are:

Complex layouts, nested tables, tables spanning multiple pages, checkboxes, radio-buttons, off-oriented scans, controlling LLM costs, checking hallucinations, Human-in-the-loop integration, and privacy.

More on the issues: https://unstract.com/blog/why-llms-struggle-with-unstructure...

freezed8•6mo ago
This blog post makes some good points about using vision models for retrieval, but I do want to call out a few problems: 1. The blog conflates indexing/retrieval with document parsing. Document parsing itself is the task of converting a document into a structured text representation, whether it's markdown/JSON (or in the case of extraction, an output that conforms to a schema). It has many uses, one of which is RAG, but many of which are not necessarily RAG related.

ColPali is great for retrieval, but you can't use ColPali (at least natively) for pure document parsing tasks. There's a lot of separate benchmarks for just evaluating doc parsing while the author mostly talks about visual retrieval benchmarks.

2. This whole idea of "You can DIY document parsing by screenshotting a page" is not new at all, lots of people have been talking about it! It's certainly fine as a baseline and does work better than standard OCR in many cases.

a. But from our experience there's still a long-tail of accuracy issues. b. It's missing metadata like confidence scores/bounding boxes etc. out of the box c. Honestly this is underrated, but creating a good screenshotting pipeline itself is non-trivial.

3. In general for retrieval, it's helpful to have both text and image representations. Image tokens are obviously much more powerful. Text tokens are way cheaper to store and let you do things like retrieval entire documents (instead of just chunks) and input that into the LLM.

(disclaimer: I am ceo of llamaindex, and we have worked on both document parsing and retrieval with LlamaCloud, but I hope my point stands in a general sense)