> Once confirmed, we corrected the extracted grade immediately.
> Where the extracted grade was accurate, we provided feedback and guidance to the reporting program or school about its interpretation and the extraction methodology.
I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.
The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.
The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.
I could be totally wrong here.
Though I imagine scenarios where the PDF is just an image (e.g. a scan of a form), and thus the validation would not work.
I do agree with the parent post. Calling them hallucinations is not an accurate way of describing what is happening and using such terms to personify these machines is a mistake.
This isn't to say the outputs aren't useful, we see that they can be very useful...when used well.
... but (a) we do, and (b) there's all kinds of dimensions of factuality not encoded in the training data that can only be unreliably inferred (in the sense that there is no reason to believe the algorithm has encoded a way to synthesize true output from the input at all).
[1] https://en.wikipedia.org/wiki/Hyatt_Regency_walkway_collapse
TL;DR: being wrong, even very wrong != hallucination
can you hear yourself? you are providing excuses for a computer system that produces erroneous output.
He is not saying it's ok for this system to provides wrong answers, he is saying it's normal for informations from LLM to not be reliable and thus the issue is not coming from the LLM, but from the way it is being used.
Eh, I don't think that's a productive thing to say. There's an immense business pressure to deploy LLMs in such decision-making contexts, from customer support, to HR, to content policing, to real policing. Further, because LLMs are improving quickly, there is a temptation to assume that maybe the worst is behind us and that models don't make too many mistakes anymore.
This applies to HN folks too: every other person here is building something in this space. So publicizing failures like this is important, and it's important to keep doing it over and over again so that you can't just say "oh, that was a 3o problem, our current models don't do that".
While I do see the issue with the word hallucination providing a humanization to the models, I have yet to come up or see a word that so well explains the problem to non technical people. And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.
Technically yeah the model is doing exactly what it is supposed to do and you could argue that all of its output is "hallucination". But for most people the idea of a hallucinated answer is easy enough to understand without diving into how the systems work, and just confusing them more.
Calling it a hallucination leads people to think that they just need to stop it from hallucinating.
In layman's terms, it'd be better to understand that LLMs are schizophrenic. Even though that's not really accurate either.
A better way to get across that the models really only understand reality by the way they've read about it and then we ask them for answers "in their own words" but that's a lot longer than "hallucination".
It's like the gag in the 40 year old version where he describes breasts feeling like bags of sand.
If a model hallucinates it did do something wrong, something that we would ideally like to minimize.
The fact that it’s impossible to completely get rid of hallucinations is separate.
An electric car uses electricity, it’s a fundamental part of its design. But we’d still like to minimize electricity usage.
I know this sounds pedantic, but I think that the phenomenon itself is very human, so it's fascinating that we created something artificial that is a little bit like another human, and here it goes, producing similar issues. Next thing you know it will have emotions, and judgment.
they actually write it: > For this cycle, we have refined our model architecture, expanded the catalog of medical schools and grading schemas, and upgraded to include the GPT-5o-mini model for increased accuracy and efficiency. Real-time validation has also been strengthened to provide programs with more reliable percentile and grade distribution data. Together, these enhancements make transcript normalization an even more powerful tool to support fair, consistent, and data-driven review in the transition to residency.
> Make a write-up about it
> Using AI
> Which then hallucinates more stuff
Funny stuff.
>Look inside
>GPT-4o-mini
If a tool is designed to extract the grades for easy access, do we really believe that the end users will then verify the grades manually to confirm the output? If they’re doing that, why use the tool at all?
Maybe the tool can extract what it believes is the grades section and show a screenshot for a human to interpret.
Because the people purchasing the tool aren't the ones who will actually use it. The former get a "Deployed AI tooling to X to increase productivity by X%" on their resume. The latter get left to deal with the mess.
So you really have to double check when researching information that really matters.
Source: spouse matched in 2018. It was one of the most stressful periods of our lives.
GPT-5o-whatever ain’t a thing.
The irony is sweeeeet
It is very tempting to do it, obviously, with the cost difference, but it's not worth it. But on the other hand, people talk about LLMs with a broad brush and I don't know, there's still testing but I would be surprised to hear that GPT-5-pro with thinking had an issue like this.
This is why it is important that users (us) don't fall into the anthropomorphism trap and call programs what they are are and what they really do. Especially important since general populace seems to be deluded by the OpenAI and Anthropic aggressive lies and believe that LLMs can think.
That's not a real model name: there's GPT-5-mini and GPT-4o-mini but no GPT-5o-mini.
UPDATE: Here's where the GPT-5o-mini came from: https://www.thalamusgme.com/blogs/methodology-for-creation-a... - via this comment: https://news.ycombinator.com/item?id=45581030
That said, I've been disappointed by OCR performance from the GPT-5 series. I caught it hallucinating some of the content for a pretty straight-forward newspaper scan a few weeks ago: https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...
Gemini 2.5 is much more reliable for extracting text from images in my experience.
Transcript data is usually found in some sort of table, but they're some of the hardest tables for OCR or LLMs to interpret. There's all kinds of edge cases with tables split across pages, nested cells, side-by-side columns, etc. The tabular layout breaks every off-the-shelf OCR engine we've run across (and we've benchmarked all of them). To make it worse, there's no consistency at all (every school in the country basically has their own format).
What we've seen help in these cases are:
1. VLM based review and correction of OCR errors for tables. OCR is still critical for determinism, but VLMs really excel at visually interpreting the long tail.
2. Using both HTML and Markdown as an LLM input format. For some of the edge cases, markdown cannot represent certain structures (e.g. a table cell nested within a table cell). HTML is a much better representation for this, and models are trained on a lot of HTML data.
The data ambiguity is a whole set of problems on its own (e.g. how do you normalize what a "semester" is across all the different ways it can be written). Eval sets + automated prompt engineering can get you pretty far though.
Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai/).
Did you do some kind of zod schema, or compare the error rate of how different models perform for this task? Did you bother setting up any kind of json output at all? Did you add a second validation step with a different model and then compared their numbers are the same?
It looks like no, they just deferred to authority the whole thing. Technically theres no difference between them saying that gpt5-mini or llama2-7b did this.
Literally every single llm will make errors and hallucinate. It's your job to put all the scaffolding around to make sure it doesn't or that it does a lot less than a skilled human would.
So then have you measured the error rate or maybe tried to put some kind of error catching mechanism just like any professional software would do?
This makes it a slightly different issue from "hallucination" as seen in text based models. The model (which I think we can assume is GPT-5-mini in this case) is being fed scanned images of PDFs and is incorrectly reading the data from them.
Is this still a hallucination? I've been unable to identify a robust definition of that term, so it's not clearly wrong to call a model misinterpreting a document a "hallucination" even though it feels to me like a different category of mistake to an LLM inventing the title of a non-existent paper or lawsuit.
The open question is how much better they need to get before they can be deployed for situations like this that require a VERY high level of reliability.
Is it? To me it sounds like they do OCR first, then extract from the result with LLM:
"Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."
Is the cost of AI useful if all you're doing is something like 'linting' the extraction? How do you guarantee that people really, truly, are doing the same work as before and not just blindly clicking 'looks good'. What is the value of the AI telling you something when you cannot tell if it is lying?
Is there some legal context in which this phrase has a specific meaning, perhaps?
I tell it a date, like March 2024 as the start, and October 2025 as the current month.
It still thinks that is 7 months somehow... and this is Anthropic's latest model..
1. Minimize the number of PDF pages per context/call. Don't dump a giant document set into one request. Break them into the smallest coherent chunks.
2. In a clean context, re-send the page and the extracted target content and ask the model to proofread/double-check the extracted data.
3. Repeat the extraction and/or the proofreading steps with a different model and compare the results.
4. Iterate until the proofreadings pass without altering the data, or flag proofreading failures for stronger models or human intervention.
This is not how this works. You know people will not do this. In fact the whole value proposition hinges on people not doing this. If the information needs to be verified by a human, then it takes more time than just going through the document.
If your product can not be trusted, then it can not be used to make important decisions. Pushing the responsibility to not use your product on the user is absurd and does not make your actions any less negligent.
Then proceeds into corporate CYA talk, that people should not actually trust this service.
But presumably people trust it, to the extent that it can have some effect, or why would they use it?
> In all reported instances, program directors identified and corrected any discrepancies using the official documents.
Would these instances have been reported, had people not discovered them, by happening to notice a discrepancy?
> Each recruitment season brings its own set of questions, adjustments, and improvements — and this is expected as the community adapts to tools and processes.
Each recruitment season brings its own set of negligent mess-ups, by the company that was entrusted not to mess up? "The community" should "expect" further mess-ups from the company, every season? And "the community" just has to "adapt" to the company's mess-ups?
Many tech startups think little of responsibility, and will just fly by night if their gambles don't pay off. And many techbros live by resume-driven-development and half-baked hype waves.
But can this company, in a medical-adjacent space, meddling in the lifetime aspirations and careers of doctors, get away with that same mindset?
> [...] as the use of Cortex expands.
Is this just optimism? The Thalamus brand name even looks a bit like Theranos.
medicalthrow•4h ago
Regardless, given that there have been a number of posts looking into usage of LLMs for numerical extraction, I thought this story useful would be a cautionary tale.
EDIT: I put "GPT-5o-mini" in quotes since that was in their methodology...yes, I know the model doesn't exist
philipallstar•4h ago
It's astonishing that places like this will do almost anything rather than create a simple API to ingest data that could easily be pushed automatically.
daemonologist•3h ago
simonw•3h ago
If all you can get are PDFs, attempting to automatically extract information from those PDFs is a reasonably decision to make. The challenge is doing it well enough to avoid these kind of show-stopper problems.
a-dub•3h ago
avarun•19m ago
aprilthird2021•4h ago
Mind-boggling idea to do this because OCR and pulling info out of PDFs has been done better and for longer by so many more mature methods than having an LLM do it
bbarnett•3h ago
lazystar•3h ago
mattnewton•3h ago
I don’t think this idea is totally cursed, I think the implementation is. Instead of using it to shortcut filling in grades that the applicant could spot check, like a resume scraper, they are just taking the first pass from the LLM as gospel.
simonw•3h ago
If all the PDFs are the same format you can use plenty of existing techniques. If you have no control at all over that format you're in for a much harder time, and vLLMs look perilously close to being a great solution.
Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks - but definitely not for something as critical as extracting medical grades that influence people's ongoing careers!
slacktivism123•2h ago
Got it. The non-experts are holding it wrong!
The laymen are told "just use the app" or "just use the website". No need to worry about API keys or routers or wrapper scripts that way!
Sure.
Yet the laymen are expected to maintain a mental model of the failure modes and intended applications of Grok 4 vs Grok 4 Fast vs Gemini 2.5 Pro vs GPT-4.1 Mini vs GPT-5 vs Claude Sonnet 4.5...
It's a moving target. The laymen read the marketing puffery around each new model release and think the newest model is even more capable.
"This model sounds awesome. OpenAI does it again! Surely it can OCR my invoice PDFs this time!"
I mean, look at it:
And on and on it goes...afro88•1h ago
The solution architect, leads, product managers and engineers that were behind this feature are now laymen who shouldn't do their due diligence on a system to be used to do an extremely important task? They shouldn't test this system across a wide range of input pdfs for accuracy and accept nothing below 100%?
simonw•1h ago
We aren't talking about non-experts here. Go read https://www.thalamusgme.com/blogs/methodology-for-creation-a...
They're clearly competent developers (despite mis-identifying GPT-5-mini as GPT-5o-mini) - but they also don't appear to have evaluated the alternative models, presumably because of this bit:
"This solution was selected given Thalamus utilizes Microsoft Azure for cloud hosting and has an enterprise agreement with them, as well as with OpenAI, which improves overall data and model security"
I agree with your general point though. I've been a pretty consistent voice in saying that this stuff is extremely difficult to use.
ajcp•1h ago
I have come to the same conclusion having built a workflow that has seen 10 million+ non-standardized PDFs (freight bill of ladings) with running evaluations, as well as against the initial "ground-truth" dataset of 1,000 PDFs.
Humans: ~65% accurate
Gemini 1.5: ~72% accurate
Gemini 2.0: ~88% accurate
Gemini 2.5: ~92%* accurate
*Funny enough we were getting a consistent 2% improvement with 2.5 over 2.0 (90% versus 88%) until as a lark we decided to just copy the same prompt 10x. Squeezed 2% more out of that one :D
simonw•19m ago
walkabout•2h ago
Like, this company could have done the same projects we've been doing but probably gotten them done faster (and certainly with better performance and lower operational costs) any time in the last 15 years or so. We're doing them now because "we gotta do 'AI'!" so there's funding for it, but they could have just spent less money doing it with OpenCV or whatever years and years ago.
0x457•9m ago
daemonologist•3h ago
The particular challenge here I think is that the PDFs are coming in any flavor and format (including scans of paper) and so you can't know where the grades are going to be or what they'll look like ahead of time. For this I can't think of any mature solutions.
williamdclt•3h ago
> that information is buried in PDFs sent by schools (often not standardized).
I don't think OCR will help you there.
An LLM can help, but _trusting_ it is irresponsible. Use it to help a human quickly find the grade in the PDF, don't expect it to always get it right.
aprilthird2021•3h ago
simonw•1h ago
fxwin•1h ago
Edit: Does sound like it - "Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."
simonw•1h ago
But... that document also says:
"For machine-readable transcripts, text was directly parsed and normalized without modification. For non-machine-readable transcripts, advanced Optical Character Recognition (OCR) powered by a Large Language Model (LLM) was applied to convert unstructured image-based data into text"
Which makes it sounds like they were using vision-LLMs for that OCR step.
Using a separate OCR step before the LLMs is a lot harder when you are dealing with weird table layouts in the documents, which traditional OCR has usually had trouble with. Current vision LLMs are notably good at that kind of data extraction.
fxwin•11m ago
alexpotato•3h ago
A couple jobs ago at a hedged fund, I owned the system that would take financial data from counterparties, process it and send it to internal teams for reconciliation etc.
The spectrum went from "receive updates via SWIFT (as in financial) protocol" to "small oil trading shop sending us PDFs that are different every month". As you can imagine, the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.
As others have pointed out: yes, the overall thrust of the industry is to get to something standardized but 100% adoption will probably never happen.
I write more about the FTP side of things in the Twitter thread below: https://x.com/alexpotato/status/1809579426687983657
jaggederest•1h ago
I'm interested in what the conditions were that didn't let you reject those kind of transactions, or blacklist them for the future.
We hear about companies firing/banning unprofitable customers sometimes, surprised it doesn't happen more often honestly.
animalmother•2h ago
apt-apt-apt-apt•1h ago
lozenge•2h ago
simonw•1h ago
lozenge•1h ago
medicalthrow•1h ago
doctorpangloss•1h ago
Don’t tell me the grades should be gathered accurately. Obviously. Tell me something bigger.