GPT-5o-mini hallucinates medical residency applicant grades

https://www.thalamusgme.com/blogs/cortex-core-clerkship-grades-and-transcript-normalization

170•medicalthrow•4h ago

Comments

medicalthrow•4h ago

Hi HN, submitting from a burner since I'm an applicant this current medical residency admissions cycle. I thought it was interesting to show the real world implications of using LLMs to extract information from PDFs. For context, thalamus is a company that handles the "backend" for residency programs and all the applications they receive (including handling who to invite for interviews, etc). One of the more important factors in deciding applicant competitiveness is their medical school performance (their grades), but that information is buried in PDFs sent by schools (often not standardized). So this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...). Some programs have noticed there is a discrepancy between extracted vs reported grades (often in the direction of hallucinating "fails") and brought it to the attention of thalamus. Unfortunately, it doesn't look like the main company is discontinuing usage of the tool.

Regardless, given that there have been a number of posts looking into usage of LLMs for numerical extraction, I thought this story useful would be a cautionary tale.

EDIT: I put "GPT-5o-mini" in quotes since that was in their methodology...yes, I know the model doesn't exist

philipallstar•4h ago

Thank you for sharing this.

It's astonishing that places like this will do almost anything rather than create a simple API to ingest data that could easily be pushed automatically.

daemonologist•3h ago

The trouble is getting people to use your API - in this case med schools, but it can be much, much worse (more and smaller organizations sending you data, and in some industries you have a legal obligation to accept it in any format they care to send).

simonw•3h ago

I imagine they would love to create a simple API for this, but the problem is convincing thousands of schools to use that API.

If all you can get are PDFs, attempting to automatically extract information from those PDFs is a reasonably decision to make. The challenge is doing it well enough to avoid these kind of show-stopper problems.

a-dub•3h ago

they're essentially an ATS SAAS for medical school, if they have enough schools or enough prestigious schools, they can ask for whatever they want and the applicant schools would oblige. cheeky way to make it happen overnight: give a slight advantage to transcripts that are submitted digitally- the conversion would be complete within months.

avarun•19m ago

If you want to get sued, sure.

aprilthird2021•4h ago

> this year, they decided to pilot a tool that would extract that info (using "GPT-5o-mini": https://www.thalamusgme.com/blogs/methodology-for-creation-a...).

Mind-boggling idea to do this because OCR and pulling info out of PDFs has been done better and for longer by so many more mature methods than having an LLM do it

bbarnett•3h ago

Welcome to the world of greybeards, baffled by everyone using AWS at 100s to 100000s of times the cost of your own servers.

lazystar•3h ago

spectre/meltdown, finding out your 6 month order of ssd's was stolen after opening empty boxes in the datacenter, and having to write RCA's for customers after your racks go over the PSU's limit are things ya'll greybeards seem to gloss over in your calculations, heh

mattnewton•3h ago

Nit, I’d say as someone who spend a fair amount of time doing it in the life insurance space, actually parsing arbitrary pdfs is very much not a solved problem without LLMs. Parsing a particular pdf is, at least until they change their table format or w/e.

I don’t think this idea is totally cursed, I think the implementation is. Instead of using it to shortcut filling in grades that the applicant could spot check, like a resume scraper, they are just taking the first pass from the LLM as gospel.

simonw•3h ago

Right - the problem with PDF extraction is always the enormous variety of shapes that data might take in those PDFs.

If all the PDFs are the same format you can use plenty of existing techniques. If you have no control at all over that format you're in for a much harder time, and vLLMs look perilously close to being a great solution.

Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks - but definitely not for something as critical as extracting medical grades that influence people's ongoing careers!

slacktivism123•2h ago

>Just not the GPT-5 series! My experiments so far put Gemini 2.5 at the top of the pack, to the point where I'd almost trust it for some tasks

Got it. The non-experts are holding it wrong!

The laymen are told "just use the app" or "just use the website". No need to worry about API keys or routers or wrapper scripts that way!

Sure.

Yet the laymen are expected to maintain a mental model of the failure modes and intended applications of Grok 4 vs Grok 4 Fast vs Gemini 2.5 Pro vs GPT-4.1 Mini vs GPT-5 vs Claude Sonnet 4.5...

It's a moving target. The laymen read the marketing puffery around each new model release and think the newest model is even more capable.

"This model sounds awesome. OpenAI does it again! Surely it can OCR my invoice PDFs this time!"

I mean, look at it:

    GPT‑5 not only outperforms previous models on benchmarks and answers questions more quickly, but—most importantly—is more useful for real-world queries.

    GPT‑5 is our best model yet for health-related questions, empowering users to be informed about and advocate for their health. The model scores significantly higher than any previous model on HealthBench , an evaluation we published earlier this year based on realistic scenarios and physician-defined criteria.

    GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health. It sets a new state of the art across math (94.6% on AIME 2025 without tools), real-world coding (74.9% on SWE-bench Verified, 88% on Aider Polyglot), multimodal understanding (84.2% on MMMU), and health (46.2% on HealthBench Hard)

    The model excels across a range of multimodal benchmarks, spanning visual, video-based, spatial, and scientific reasoning. Stronger multimodal performance means ChatGPT can reason more accurately over images and other non-text inputs—whether that’s interpreting a chart, summarizing a photo of a presentation, or answering questions about a diagram.

And on and on it goes...

afro88•1h ago

> The laymen

The solution architect, leads, product managers and engineers that were behind this feature are now laymen who shouldn't do their due diligence on a system to be used to do an extremely important task? They shouldn't test this system across a wide range of input pdfs for accuracy and accept nothing below 100%?

simonw•1h ago

"The non-experts are holding it wrong!"

We aren't talking about non-experts here. Go read https://www.thalamusgme.com/blogs/methodology-for-creation-a...

They're clearly competent developers (despite mis-identifying GPT-5-mini as GPT-5o-mini) - but they also don't appear to have evaluated the alternative models, presumably because of this bit:

"This solution was selected given Thalamus utilizes Microsoft Azure for cloud hosting and has an enterprise agreement with them, as well as with OpenAI, which improves overall data and model security"

I agree with your general point though. I've been a pretty consistent voice in saying that this stuff is extremely difficult to use.

ajcp•1h ago

-> put Gemini 2.5 at the top of the pack

I have come to the same conclusion having built a workflow that has seen 10 million+ non-standardized PDFs (freight bill of ladings) with running evaluations, as well as against the initial "ground-truth" dataset of 1,000 PDFs.

Humans: ~65% accurate

Gemini 1.5: ~72% accurate

Gemini 2.0: ~88% accurate

Gemini 2.5: ~92%* accurate

*Funny enough we were getting a consistent 2% improvement with 2.5 over 2.0 (90% versus 88%) until as a lark we decided to just copy the same prompt 10x. Squeezed 2% more out of that one :D

simonw•19m ago

Gemini 3.0 is rumored to drop any day now, will be very interesting to see the score that gets for your benchmark here.

walkabout•2h ago

I've been doing PDF data extraction with LLMs at my day job, and my experience is to get them sufficiently reliable for a document of even moderate complexity (say, has tables and such, form fields, that kind of thing) you end up writing prompts so tightly-coupled to the format of the document that there's nothing but down-side versus doing the same thing with traditional computer vision systems. Like, it works (ask me again in a couple years as the underlying LLMs have been switched out, whether it's turned into wack-a-mole and long-missed data corruption issues... I'd bet it will) but using an LLM isn't gaining us anything at all.

Like, this company could have done the same projects we've been doing but probably gotten them done faster (and certainly with better performance and lower operational costs) any time in the last 15 years or so. We're doing them now because "we gotta do 'AI'!" so there's funding for it, but they could have just spent less money doing it with OpenCV or whatever years and years ago.

0x457•9m ago

Would it be helpful if LLM creates bounding boxes for "traditional" OCR to work on? I.e. allowing extraction of information of arbitrary PDF as if it was a "particular pdf"

daemonologist•3h ago

I would love to hear more about the solutions you have in mind, if you're willing.

The particular challenge here I think is that the PDFs are coming in any flavor and format (including scans of paper) and so you can't know where the grades are going to be or what they'll look like ahead of time. For this I can't think of any mature solutions.

williamdclt•3h ago

The parent says

> that information is buried in PDFs sent by schools (often not standardized).

I don't think OCR will help you there.

An LLM can help, but _trusting_ it is irresponsible. Use it to help a human quickly find the grade in the PDF, don't expect it to always get it right.

aprilthird2021•3h ago

Don't most jobs do OCR on the resumes sent in for employment? I get that a resume is a more standard format. Maybe that's the rub

simonw•1h ago

The challenge here is that it's not just OCR for extracting text from a resume, this is about extracting grades from school transcripts. That's a LOT harder, see this excellent comment: https://news.ycombinator.com/item?id=45581480

fxwin•1h ago

I would assume they OCR first, then extract whatever info they need from the result using LLMs

Edit: Does sound like it - "Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."

simonw•1h ago

It's a bit difficult to derive exactly what they're using here. There's quite a lot of detail in https://www.thalamusgme.com/blogs/methodology-for-creation-a... but still mentions "OCR models" separately from LLMs, including a diagram that shows OCR models as a separate layer before the LLM layer.

But... that document also says:

"For machine-readable transcripts, text was directly parsed and normalized without modification. For non-machine-readable transcripts, advanced Optical Character Recognition (OCR) powered by a Large Language Model (LLM) was applied to convert unstructured image-based data into text"

Which makes it sounds like they were using vision-LLMs for that OCR step.

Using a separate OCR step before the LLMs is a lot harder when you are dealing with weird table layouts in the documents, which traditional OCR has usually had trouble with. Current vision LLMs are notably good at that kind of data extraction.

fxwin•11m ago

Thanks, I didn't see that part!

alexpotato•3h ago

It's amazing how much of "inter organization information flow" still happens over PDFs and/or just FTP'ing files around.

A couple jobs ago at a hedged fund, I owned the system that would take financial data from counterparties, process it and send it to internal teams for reconciliation etc.

The spectrum went from "receive updates via SWIFT (as in financial) protocol" to "small oil trading shop sending us PDFs that are different every month". As you can imagine, the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.

As others have pointed out: yes, the overall thrust of the industry is to get to something standardized but 100% adoption will probably never happen.

I write more about the FTP side of things in the Twitter thread below: https://x.com/alexpotato/status/1809579426687983657

jaggederest•1h ago

> the time and effort to process those PDFs occasionally exceeded the revenue from the entire transaction with that counterparty.

I'm interested in what the conditions were that didn't let you reject those kind of transactions, or blacklist them for the future.

We hear about companies firing/banning unprofitable customers sometimes, surprised it doesn't happen more often honestly.

animalmother•2h ago

Hi wondering if you could message me at shane.shifflett@dowjones.com or via signal at 929 638 0009? https://www.wsj.com/news/author/shane-shifflett

apt-apt-apt-apt•1h ago

You are so brave. I get like 8 spammers calling me daily about loans like I owe them money, and that's without blasting my phone number out to the internet.

lozenge•2h ago

Why don't they just email a form after/when you apply and you fill in all the grades in a structured data way? How many grades are we talking about here. Then the PDF would just be the proof that your grades were real.

simonw•1h ago

Because you'd have to get thousands of schools to agree to using the same format.

lozenge•1h ago

Does the student not have access to the grades? As they are applying to medical school, a few hours of drudgery form filling will still be the easiest part of the process.

medicalthrow•1h ago

It's a bit complicated. Each school has their own grading system (some pass fail, others four tiered, others full letter grades). Additionally, there are reported distributions for each grade. Lastly, there's sometimes a summary statement at the end that usually says "X student was 'superlative'" and then a table at the end that says 'superlative' means top X% of class. On top of that, students may not get their full dean's letter that says all of this stuff. Basically, self reporting is very difficult to do given the amount of variability in grade reporting.

doctorpangloss•1h ago

How should medical residency work? Like how should admissions work, is the match doing what you would want it to do, is there a radical alternative, etc? You have our attention!

Don’t tell me the grades should be gathered accurately. Obviously. Tell me something bigger.

beernet•4h ago

Nothing new to see here. If you are still surprised by model hallucinations in 2025, it might be time for you to catch up or jump on the next hype bandwagon. Also, they reacted well:

> Once confirmed, we corrected the extracted grade immediately.

> Where the extracted grade was accurate, we provided feedback and guidance to the reporting program or school about its interpretation and the extraction methodology.

I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

softwaredoug•4h ago

It's true, but I think people have a misunderstanding that if you add search / RAG to ground the LLM, the LLM won't hallucinate. When in reality the LLM can still hallucinate, just convincingly in the language of whatever PDF it retrieved.

bigzyg33k•3h ago

RAG certainly doesn't reduce hallucinations to 0, but using RAG correctly in this instance would have solved the hallucinations they describe.

The purpose of the system described in this post is OCR inaccuracies - it's convenient to use LLMs for OCR of PDFs because PDFs do not have standard layouts - just using the text strings extracted from the PDFs code results in incorrect paragraph/sentence sequencing.

The way they *should* have used RAG is to ensure that subsentence strings extracted via LLM appear in the PDF at all, but it appears they were just trusting the output without automated validation of the OCR.

eoinbmorg•2h ago

Is RAG the right tool for this? My understanding was that RAG uses vector similarity to compare queries (the extracted string) versus the search corpus (the PDF file) using vector similarities. The use case you describe is verification, which sounds like it would be better done with an exhaustive search via string comparison isntead of vector similarities.

I could be totally wrong here.

jaccola•1h ago

RAG is just "Retrieval Augmented Generation", vector similarity is one way to do that retrieval but not the only. Though you are right, there is really no retrieval step augmenting the generation here, more like just a validation step stuck on the end.

Though I imagine scenarios where the PDF is just an image (e.g. a scan of a form), and thus the validation would not work.

simonw•1h ago

Some people define RAG as having to use vector search, others (myself included) define RAG as any technique that retrieves additional relevant context to help generate the response, which can include triggering things like full-text search queries or even grep (increasingly common thanks to Claude Code et al).

whoknowsidont•4h ago

I think the definition of hallucination fits pretty neatly.

byteknight•4h ago

Never thought about it from that perspective, but I think you're right. It is by design, not deceptive intent, just the infinite monkeys theorem where we've replaced randomness with pattern matching trained on massive datasets.

walkabout•3h ago

The way I've been putting it for a while is, "all they do is hallucinate—it's the only thing they do. Sometimes the hallucinations are useful."

leprechaun1066•3h ago

Another way to look at it is everything a LLM creates is a 'hallucination', some of these 'hallucinations' are more useful than others.

I do agree with the parent post. Calling them hallucinations is not an accurate way of describing what is happening and using such terms to personify these machines is a mistake.

This isn't to say the outputs aren't useful, we see that they can be very useful...when used well.

fragmede•3h ago

The OpenAI paper on hallucinations gives actual technical reasons for them, if you're interested.

https://openai.com/index/why-language-models-hallucinate/

https://arxiv.org/abs/2509.04664

shadowgovt•3h ago

The key idea is the model doesn't have any signal on "factual information." It has a huge corpus of training data and the assumption humans generally don't lie to each other when creating such a corpus.

... but (a) we do, and (b) there's all kinds of dimensions of factuality not encoded in the training data that can only be unreliably inferred (in the sense that there is no reason to believe the algorithm has encoded a way to synthesize true output from the input at all).

tdeck•3h ago

The story isn't the hallucination, it's that people are using this shit in risky ways and ignoring the known problems with it. Engineers knew well before 1981 that building this [1] wasn't safe, but that didn't stop someone from building it. When it collapsed, it was a story.

[1] https://en.wikipedia.org/wiki/Hyatt_Regency_walkway_collapse

seymore_12•3h ago

All models are wrong, but some are useful. https://en.wikipedia.org/wiki/All_models_are_wrong

H8crilA•3h ago

I also hate the term "hallucination", but for a different reason. A hallucination is a confusion of internal stimulus as an external input. The models simply make errors, have bad memory, are overconfident, are sampling from a fantasy world, or straight up lie; often at rates that are not dissimilar from humans. For models to truly hallucinate, develop delusions and all that good schizophrenia stuff we would need to have a truly recurrent structure that has enough time to go through something similar to the prodrome, and build up distortions and ideas.

TL;DR: being wrong, even very wrong != hallucination

a-dub•3h ago

> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

can you hear yourself? you are providing excuses for a computer system that produces erroneous output.

nolok•3h ago

No he does not.

He is not saying it's ok for this system to provides wrong answers, he is saying it's normal for informations from LLM to not be reliable and thus the issue is not coming from the LLM, but from the way it is being used.

random9749832•1h ago

We are in the late stage of the hype cycle for LLMs where the comments are becoming progressively ridiculous like for cryptocoins before the market crashed. The other day a user posted that LLMs are the new transistors or electricity.

MountDoom•3h ago

> Nothing new to see here.

Eh, I don't think that's a productive thing to say. There's an immense business pressure to deploy LLMs in such decision-making contexts, from customer support, to HR, to content policing, to real policing. Further, because LLMs are improving quickly, there is a temptation to assume that maybe the worst is behind us and that models don't make too many mistakes anymore.

This applies to HN folks too: every other person here is building something in this space. So publicizing failures like this is important, and it's important to keep doing it over and over again so that you can't just say "oh, that was a 3o problem, our current models don't do that".

testdelacc1•3h ago

I completely agree with you. GP’s cynical take is an upvote magnet but doesn’t contribute to the discourse.

nerdjon•2h ago

> I still dislike the term "hallucinations". It comes across like the model did something wrong. It did not, as factually wrong outputs happen per design.

While I do see the issue with the word hallucination providing a humanization to the models, I have yet to come up or see a word that so well explains the problem to non technical people. And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.

Technically yeah the model is doing exactly what it is supposed to do and you could argue that all of its output is "hallucination". But for most people the idea of a hallucinated answer is easy enough to understand without diving into how the systems work, and just confusing them more.

parineum•44m ago

> And quite frankly those are the people that need to understand that this problem still very much exists and is likely never going away.

Calling it a hallucination leads people to think that they just need to stop it from hallucinating.

In layman's terms, it'd be better to understand that LLMs are schizophrenic. Even though that's not really accurate either.

A better way to get across that the models really only understand reality by the way they've read about it and then we ask them for answers "in their own words" but that's a lot longer than "hallucination".

It's like the gag in the 40 year old version where he describes breasts feeling like bags of sand.

CooCooCaCha•52m ago

I don’t understand the issue with the word “hallucination”.

If a model hallucinates it did do something wrong, something that we would ideally like to minimize.

The fact that it’s impossible to completely get rid of hallucinations is separate.

An electric car uses electricity, it’s a fundamental part of its design. But we’d still like to minimize electricity usage.

npteljes•19m ago

Hallucinations are also completely normal, "by design", just the output / experience of the system that produces it. It's just us who decided on the classification of what's real and what isn't, and looking at the state of things, we are not even very good on agreeing on the limit.

I know this sounds pedantic, but I think that the phenomenon itself is very human, so it's fascinating that we created something artificial that is a little bit like another human, and here it goes, producing similar issues. Next thing you know it will have emotions, and judgment.

temp_correct_02•4h ago

There is no such thing as GPT-5o-mini, or GPT-5o. Concerning that the methodology seems to repeat the same error, not just the submitted title.

notfromhere•4h ago

they probably mean gpt-5-low. but the small models are bad for parsing data where the data has strong implications

busssard•4h ago

https://www.thalamusgme.com/blogs/methodology-for-creation-a...

they actually write it: > For this cycle, we have refined our model architecture, expanded the catalog of medical schools and grading schemas, and upgraded to include the GPT-5o-mini model for increased accuracy and efficiency. Real-time validation has also been strengthened to provide programs with more reliable percentile and grade distribution data. Together, these enhancements make transcript normalization an even more powerful tool to support fair, consistent, and data-driven review in the transition to residency.

andrepd•3h ago

> AI hallucinates students' grades

> Make a write-up about it

> Using AI

> Which then hallucinates more stuff

Funny stuff.

ahartmetz•2h ago

This is the singularity: AI makes up stuff about AI better than humans could.

rarisma•4h ago

5o mini?

johnfn•4h ago

Guess it hallucinated the model name as well.

tripplyons•3h ago

I'm assuming they mean gpt-5-mini. I'm honestly surprised how many people I've heard say "5o".

ikeashark•4h ago

>AI hallucinates

>Look inside

>GPT-4o-mini

Aurornis•4h ago

Frustrating that their official recommendation is to verify the grades manually.

If a tool is designed to extract the grades for easy access, do we really believe that the end users will then verify the grades manually to confirm the output? If they’re doing that, why use the tool at all?

Maybe the tool can extract what it believes is the grades section and show a screenshot for a human to interpret.

landl0rd•4h ago

Because the contract has already been signed, they can't guarantee it works right, and they don't want to be open to lawsuits. "You, mister wrongly-denied applicant, cannot sue us; we specifically told them to check all grades manually!"

SketchySeaBeast•3h ago

This is why this particular emperor has no clothes. They keep trying to jam AI into stuff to make it "easier", but the LLMs, by their very nature do the tasks in lossy or incorrect ways. Imagine if Microsoft had sold Excel with a "be sure to verify all the calculations" caveat.

GoatInGrey•48m ago

> If they’re doing that, why use the tool at all?

Because the people purchasing the tool aren't the ones who will actually use it. The former get a "Deployed AI tooling to X to increase productivity by X%" on their resume. The latter get left to deal with the mess.

softwaredoug•4h ago

I see _even with search/RAG_ LLMs hallucinate. They just hallucinate more convincingly in the language of the documents you retrieved.

So you really have to double check when researching information that really matters.

owenthejumper•4h ago

This sucks. Residency match is stressful as it is, and adding systems like these just make the experience even worse for the applicants.

Source: spouse matched in 2018. It was one of the most stressful periods of our lives.

Narciss•4h ago

Not only did the AI hallucinate the applicant grade, but also the model name!

GPT-5o-whatever ain’t a thing.

The irony is sweeeeet

lukeschlather•4h ago

Using a mini model for this seems grossly irresponsible. I've been doing some work testing models for similar extraction tasks (nothing where a failure affects someone's grade or anything) and gpt mini / Gemini flash simply can't do this sort of thing. Using anything less than the highest model with reasoning, you're guaranteed to get this sort of thing happening.

It is very tempting to do it, obviously, with the cost difference, but it's not worth it. But on the other hand, people talk about LLMs with a broad brush and I don't know, there's still testing but I would be surprised to hear that GPT-5-pro with thinking had an issue like this.

bilekas•4h ago

Am I crazy or has text parsing been mastered long before AI. Why is GPT being used in this scenario in the first place ?

hansonkd•3h ago

It seems like a default mode for AI should be to generate repeatable Regex for text extraction.

tdeck•3h ago

Unfortunately many PDFs don't even internally represent text in a contiguous way.

tdeck•3h ago

Because it's less effort to get an MVP set up. Instead of having to test on a bunch of different PDFs and figure out how to address the right location in the text, just write a paragraph asking the LLM to do it. Of course, there are certain drawbacks...

mattnewton•3h ago

Because it’s easier than asking for a consistently formatted data from all the sources who just output random PDFs. Basically this is a coordination / people problem we’re papering over with a fancy engineering solution. Many such cases.

hluska•29m ago

Not in PDF.

fragmede•11m ago

Tables in PDFs still confuse traditional OCR engines. VLMs do better in some cases (though not this one, apparently).

dabei•4h ago

Nothing new to see here. Human also hallucinates, as you can tell from the model name.

Yizahi•3h ago

LLM can't hallucinate. Correct phrase would be "GPT-5o-mini generates medical residency applicant grades". Everywhere you see word hallucinate in regards of a program output, it should be replaced with generate for clarity.

noboostforyou•1h ago

If you're being 100% literal, sure. But language evolves and it's the accepted term for the concept. OpenAI themselves uses the phrase - https://openai.com/index/why-language-models-hallucinate/

Yizahi•1h ago

OpenAI are the last people who I would take as a reference, because they are financially motivated to keep the charade of a "thinking" LLM or so called "AI". That's why they are widely using anthropomorphic terms like "hallucination" or "reasoning" or "thinking", while their computer programs can't do neither of those things. LLM companies sometimes even expose their hypocrisy. My favorite example for now is when Antropic showed in their own paper that asking LLM how it "reasoned" through calculating a sum of numbers doesn't match reality at all, it's all generated slop.

This is why it is important that users (us) don't fall into the anthropomorphism trap and call programs what they are are and what they really do. Especially important since general populace seems to be deluded by the OpenAI and Anthropic aggressive lies and believe that LLMs can think.

simonw•3h ago

Where did the "GPT-5o-mini" in this headline come from?

That's not a real model name: there's GPT-5-mini and GPT-4o-mini but no GPT-5o-mini.

UPDATE: Here's where the GPT-5o-mini came from: https://www.thalamusgme.com/blogs/methodology-for-creation-a... - via this comment: https://news.ycombinator.com/item?id=45581030

That said, I've been disappointed by OCR performance from the GPT-5 series. I caught it hallucinating some of the content for a pretty straight-forward newspaper scan a few weeks ago: https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...

Gemini 2.5 is much more reliable for extracting text from images in my experience.

kbyatnal•3h ago

School transcripts are surprisingly one of the hardest documents to parse. The thing that makes them tricky is (1) the multi-column tabular layouts and (2) the data ambiguity.

Transcript data is usually found in some sort of table, but they're some of the hardest tables for OCR or LLMs to interpret. There's all kinds of edge cases with tables split across pages, nested cells, side-by-side columns, etc. The tabular layout breaks every off-the-shelf OCR engine we've run across (and we've benchmarked all of them). To make it worse, there's no consistency at all (every school in the country basically has their own format).

What we've seen help in these cases are:

1. VLM based review and correction of OCR errors for tables. OCR is still critical for determinism, but VLMs really excel at visually interpreting the long tail.

2. Using both HTML and Markdown as an LLM input format. For some of the edge cases, markdown cannot represent certain structures (e.g. a table cell nested within a table cell). HTML is a much better representation for this, and models are trained on a lot of HTML data.

The data ambiguity is a whole set of problems on its own (e.g. how do you normalize what a "semester" is across all the different ways it can be written). Eval sets + automated prompt engineering can get you pretty far though.

Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (https://extend.ai/).

meisel•3h ago

Would it help a lot to run it through multiple different AI systems and verify that they agree on the result?

kbyatnal•2h ago

Yeah that can occasionally work and something we've tested, but it introduces a lot of noise unfortunately and makes systematic evals difficult.

gallerdude•3h ago

I wonder if they're using reasoning? It usually eliminates these types of errors

nisten•3h ago

While I don't want to discount the work of any physician-founded org knowing the pain they go through from working with them after they've seen 18 patients in a days work, this still just just looks like bad software. With no testing, no internal bench.

Did you do some kind of zod schema, or compare the error rate of how different models perform for this task? Did you bother setting up any kind of json output at all? Did you add a second validation step with a different model and then compared their numbers are the same?

It looks like no, they just deferred to authority the whole thing. Technically theres no difference between them saying that gpt5-mini or llama2-7b did this.

Literally every single llm will make errors and hallucinate. It's your job to put all the scaffolding around to make sure it doesn't or that it does a lot less than a skilled human would.

So then have you measured the error rate or maybe tried to put some kind of error catching mechanism just like any professional software would do?

simonw•3h ago

Lots of comments in here that seem to have missed that this is about using vision-LLMs for OCR.

This makes it a slightly different issue from "hallucination" as seen in text based models. The model (which I think we can assume is GPT-5-mini in this case) is being fed scanned images of PDFs and is incorrectly reading the data from them.

Is this still a hallucination? I've been unable to identify a robust definition of that term, so it's not clearly wrong to call a model misinterpreting a document a "hallucination" even though it feels to me like a different category of mistake to an LLM inventing the title of a non-existent paper or lawsuit.

lysecret•3h ago

These kinds of errors have always existed and will always exist there is no perfect way to extract info from documents like this.

simonw•3h ago

The models really are getting better though. Compare Gemini 1.5 and Gemini 2.5 on the same PDF document (I've done this a bunch) and you can see the difference.

The open question is how much better they need to get before they can be deployed for situations like this that require a VERY high level of reliability.

lysecret•3h ago

I fully agree. My point was more a lot of commenters seem or implicitly compare the llm based approach with some “better” or “simpler” approach which really doesn’t exist from my estimation LLMs are sota for this kind of extractions (though they still have issues).

hoosieree•2h ago

People don't respect the chasm between "obviously no mistakes" and "no obvious mistakes".

fxwin•1h ago

> this is about using vision-LLMs for OCR

Is it? To me it sounds like they do OCR first, then extract from the result with LLM:

"Cortex uses automated extraction (optical character recognition (OCR) and natural language processing (NLP)) to parse clerkship grades from medical school transcripts."

simonw•1h ago

See comment I just posted here: https://news.ycombinator.com/item?id=45582982

powersurge360•3h ago

I keep circling this with AI and I'm not really sure what to do with it. They mention that the AI is meant to be used as reference only in the linked article but what does that actually mean? Who is checking who? Is the AI filling out the data from what it sees in the PDF and the user is expected to check it or is the user filling out the data and the AI is expected to check it?

Is the cost of AI useful if all you're doing is something like 'linting' the extraction? How do you guarantee that people really, truly, are doing the same work as before and not just blindly clicking 'looks good'. What is the value of the AI telling you something when you cannot tell if it is lying?

omnicognate•3h ago

Yeah, I've seen this "for reference only" wording in many places, often used as a sort of disclaimer on stuff that could be wrong, but I have absolutely no idea what it means in that context. To me "reference" implies comprehensive, high quality information that I can refer to when I need to know some obscure detail of something.

Is there some legal context in which this phrase has a specific meaning, perhaps?

bobbyprograms•3h ago

Seems like hallucination will always be an issue for predict the next word training. Maybe we need to rethink pretraining .

lysecret•3h ago

I see your point here but please take a look at the “standard” unstructured pdf extraction algos they have a lot of problems as well. Llm based extraction is still (on avergad) a big improvement.

moomoo11•2h ago

Semi-related but Sonnet 4.5 drives me absolutely insane.

I tell it a date, like March 2024 as the start, and October 2025 as the current month.

It still thinks that is 7 months somehow... and this is Anthropic's latest model..

socrateswasone•2h ago

It's predicting the next token by statistical approximation. Hallucination vs fact is an ad-hoc distinction we impose on the result to suit our purpose.

lawlessone•1h ago

I thought this was supposed to be AGI?

anotherpaulg•1h ago

I regularly use LLM-as-OCR and find it really helpful to:

1. Minimize the number of PDF pages per context/call. Don't dump a giant document set into one request. Break them into the smallest coherent chunks.

2. In a clean context, re-send the page and the extracted target content and ask the model to proofread/double-check the extracted data.

3. Repeat the extraction and/or the proofreading steps with a different model and compare the results.

4. Iterate until the proofreadings pass without altering the data, or flag proofreading failures for stronger models or human intervention.

vishdipsheet•1h ago

Wow. Never ceases to amaze me how some people in these comment sections remain blind to the power of Artificial Intelligence (AI). Have you not tried prompting the model correctly? My startup gets 0 hallucinations on the latest iteration of Claude Sonnet using a custom proprietary reflecting RAG framework inspired by ontology.

hluska•26m ago

It never ceases to amaze me when startup founders claim that every problem is the same. Some use cases (like parsing text out of PDF) can’t be distilled down to a prompt.

constantcrying•55m ago

>Reviewers are strongly encouraged to verify all information against the applicant’s official PDF transcript. This reminder is also displayed directly within the product.

This is not how this works. You know people will not do this. In fact the whole value proposition hinges on people not doing this. If the information needs to be verified by a human, then it takes more time than just going through the document.

If your product can not be trusted, then it can not be used to make important decisions. Pushing the responsibility to not use your product on the user is absurd and does not make your actions any less negligent.

neilv•21m ago

> We recognize the primary concern is that an inaccuracy could unfairly disadvantage an applicant in the selection process.

Then proceeds into corporate CYA talk, that people should not actually trust this service.

But presumably people trust it, to the extent that it can have some effect, or why would they use it?

> In all reported instances, program directors identified and corrected any discrepancies using the official documents.

Would these instances have been reported, had people not discovered them, by happening to notice a discrepancy?

> Each recruitment season brings its own set of questions, adjustments, and improvements — and this is expected as the community adapts to tools and processes.

Each recruitment season brings its own set of negligent mess-ups, by the company that was entrusted not to mess up? "The community" should "expect" further mess-ups from the company, every season? And "the community" just has to "adapt" to the company's mess-ups?

Many tech startups think little of responsibility, and will just fly by night if their gambles don't pay off. And many techbros live by resume-driven-development and half-baked hype waves.

But can this company, in a medical-adjacent space, meddling in the lifetime aspirations and careers of doctors, get away with that same mindset?

> [...] as the use of Cortex expands.

Is this just optimism? The Thalamus brand name even looks a bit like Theranos.

How bad can a $2.97 ADC be?

Prefix sum: 20 GB/s (2.6x baseline)

Why your boss isn't worried about AI – "can't you just turn it off?"

ADS-B Exposed

Astronomers 'image' a mysterious dark object in the distant Universe

Ultrasound is ushering a new era of surgery-free cancer treatment

Automatic K8s pod placement to match external service zones

How AI hears accents: An audible visualization of accent clusters

New lab-grown human embryo model produces blood cells

A 12,000-year-old obelisk with a human face was found in Karahan Tepe

America Is Sliding Toward Illiteracy

Zoo of array languages

Show HN: Metorial (YC F25) – Vercel for MCP

What do Americans die from vs. what the news report on

Beyond the SQLite Single-Writer Limitation with Concurrent Writes

Pyrefly: Python type checker and language server in Rust

Don’t Look Up: Sensitive internal links in the clear on GEO satellites [pdf]

Why is everything so scalable?

Kyber (YC W23) Is Hiring an Enterprise AE

Palisades Fire suspect's ChatGPT history to be used as evidence

Wireshark 4.6.0 Supports macOS Pktap Metadata (PID, Process Name, etc.)

Subverting Telegram's end-to-end encryption (2023)

Hold Off on Litestream 0.5.0

CRISPR-like tools that finally can edit mitochondria DNA

America is getting an AI gold rush instead of a factory boom

Show HN: CSS Extras

The phaseout of the mmap() file operation

Intel Announces Inference-Optimized Xe3P Graphics Card with 160GB VRAM

Nexperia – Update on Company Developments

Copy-and-Patch: A Copy-and-Patch Tutorial

GPT-5o-mini hallucinates medical residency applicant grades

Comments

How bad can a $2.97 ADC be?

Prefix sum: 20 GB/s (2.6x baseline)

Why your boss isn't worried about AI – "can't you just turn it off?"

ADS-B Exposed

Astronomers 'image' a mysterious dark object in the distant Universe

Ultrasound is ushering a new era of surgery-free cancer treatment

Automatic K8s pod placement to match external service zones

How AI hears accents: An audible visualization of accent clusters

New lab-grown human embryo model produces blood cells

A 12,000-year-old obelisk with a human face was found in Karahan Tepe

America Is Sliding Toward Illiteracy

Zoo of array languages

Show HN: Metorial (YC F25) – Vercel for MCP

What do Americans die from vs. what the news report on

Beyond the SQLite Single-Writer Limitation with Concurrent Writes

Pyrefly: Python type checker and language server in Rust

Don’t Look Up: Sensitive internal links in the clear on GEO satellites [pdf]

Why is everything so scalable?

Kyber (YC W23) Is Hiring an Enterprise AE

Palisades Fire suspect's ChatGPT history to be used as evidence

Wireshark 4.6.0 Supports macOS Pktap Metadata (PID, Process Name, etc.)

Subverting Telegram's end-to-end encryption (2023)

Hold Off on Litestream 0.5.0

CRISPR-like tools that finally can edit mitochondria DNA

America is getting an AI gold rush instead of a factory boom

Show HN: CSS Extras

The phaseout of the mmap() file operation

Intel Announces Inference-Optimized Xe3P Graphics Card with 160GB VRAM

Nexperia – Update on Company Developments

Copy-and-Patch: A Copy-and-Patch Tutorial