Superhuman performance of an LLM on the reasoning tasks of a physician

34•amichail•21h ago

Comments

spwa4•19h ago

I don't understand what the aim is here. LLMs have disadvantages compared to human doctors that make them a really, really bad option.

1) they can't take measurements themselves.

2) they don't adapt on the job. Illnesses do. In other words, if there is a contagious health emergency, an LLM would see the patients ... and ignore the emergency.

3) they are very bad at figuring out if a patient is lying to them (which is a required skill: combined with 2, people would figure out how to get the LLM to prescribe them morfine and ...)

4) they are generally socially problematic. A big part of being a doctor is gently convincing a patient their slightly painful toe does not in fact justify a diagnosis of bone cancer ... WITHOUT doing tests (that would be unethical, as there's zero chance of those tests yielding positive results)

5) they will not adapt to people. LLMs will not adapt, people will. This means patients will exploit LLMs to achieve a whole bunch of aims (like getting drugs, getting days off, getting free hospital stays, ...) and it doesn't matter how good LLMs are. An adaptive system vs a non-adaptive system ... it's a matter of time.

6) they are not themselves patients. This is a fundamental problem: it will be very hard for an LLM to collect new information about "the human condition" and new problems it may generate. There's many examples of this, from patients drinking radium solution (it lights up in the dark, so surely, it must give extra energy, right? Even sexual energy, right?) to rivers or ponds that turn out to have serious diseases lurking around. Meaning a doctor needs to be able to make the decision to go after problems in society when society finds a new, catastrophically dumb, way to hurt itself.

Now you might say "but they would still be good in the developing world, wouldn't they?". Yes, but as the tuberculosis vaccine efforts sadly showed: the developing world is developing partially because they invest nothing whatsoever in (poor) people's health. Nothing. Zero. Rien. Which means, making health services cheaper (e.g. providing a cheap tuberculosis vaccine) ... has the problem that it does not increase the value of zero. They won't pay for healthcare ... and they won't pay for cheaper healthcare. And while Bill Gates ad the US government do pay for a bit of this, they're not sustainable solutions. If, however, you train a local with basic medical skills, there's a lot they can do for free, which actually helps.

timschmidt•18h ago

3 and 4 can be highly problematic behaviors in doctors. Patients who have real medical issues are often ignored, scolded, or otherwise denied treatment because of a doctor's perception.

derbOac•18h ago

5 is also something that happens sometimes with physicians and healthcare generally anyway, and could be trained in LLMs is my guess.

1 is often (usually?) not done today by physicians per se anyway.

2 is kind of a strawman about LLMs.

6 is maybe the most challenging critique but is also kind of an empirical one, in the sense that if LLMs routinely outperform physicians in decision making (at least under certain circumstances) it will hard to make the case that it matters.

I have my biases but in general I think at least in the US there needs to be a serious rethinking about how medical decisions can be made and how care can be provided.

I'm skeptical about this paper — the real test will be something like widespread preregistered replication across a wide variety of care settings — but that would happen anyway before it would be adopted. If it works it works and if it doesn't it won't.

My guess is under the best of circumstances it won't get rid of humans, it will just change what they're doing and maybe who is doing that.

spwa4•7h ago

First you misunderstand, I think, point 3 and 4. The point there is that LLMs are incapable of it (because they don't learn on the job, and that's not the only problem, they also aren't human and thus lack perspective, that hallucination would have catastrophic outcomes here, lack of empathy). But without learning on the job, it simply cannot recognize large scale problems. It's not that they sometimes miss it. We DO NOT know everything, and analyzing wth is happening, including radically changing approach on the job is a necessity, not something optional.

Empathy, by the way, is something that happens between 2 people. It is not "inside" one person, and it goes in both directions. So that can't be fixed. Empathy towards you only works if you honestly believe that the other person is genuinely worried about you (and, of course, empowered, able and willing to do something about your situation). If you start out with an LLM, you start out 100% convinced (and correct in that, btw) that they aren't worried about you. So it won't work. That even works in reverse. Patients try to deceive doctors ... but within reason, because there's empathy the other way too. When patients have to deceive LLMs instead for a week off from work, there will be zero empathy, zero shame and zero limitations on behavior that from the patient side. This is not solvable. Hell, it'll be a problem of trust. A doctor is trying to help you ... probably ... with maybe 10% helping the company. An LLM is 100% trying to help the company ... do you take the medicine (or accept nothing's wrong?). Is it what's best for you? Let's face it: the whole point of LLM medicine is that it's not what's best for you.

Also ... so your critique is that even doctors sometimes cater to interests that are not the patients' best interest? Ok. True.

But you do realize that if you apply that as a critique of LLMs, that are corporate controlled, it's going to be 1000x worse? So I don't understand the critique. Yes doctors aren't perfect, flawed in many ways. That is not a good reason to introduce something 1000x worse. The whole point of LLM medicine is that stealing the profit!

timschmidt•5h ago

I understand fully, and disagree strongly.

> [LLMs] don't learn on the job

Funny, the ones I work with get updated with new information all the time. No reason it couldn't happen after every single encounter.

> Empathy, by the way, is something that happens between 2 people.

People will anthropomorphize anything and that includes empathizing with a rock. A rock cannot empathize back. QED.

> If you start out with an LLM, you start out 100% convinced (and correct in that, btw) that they aren't worried about you. So it won't work.

See above. Your point is also undermined by the sheer volume of people currently using LLMs as therapists.

> Patients try to deceive doctors ... but within reason, because there's empathy the other way too. When patients have to deceive LLMs instead for a week off from work, there will be zero empathy, zero shame and zero limitations on behavior that from the patient side.

You seem to think people have limitations on their behavior now. I assure you they do not. Talk to a doctor or social worker about it. It wears them out. LLMs don't get worn out and are much better at calmly and kindly interacting for as long as someone needs.

> A doctor is trying to help you ... probably ... with maybe 10% helping the company. An LLM is 100% trying to help the company

Some doctors are. Some are more worried about their next golf game or maserati. You are fully pessimistic about LLMs but incredibly naive about people.

> Let's face it: the whole point of LLM medicine is that it's not what's best for you.

Hard disagree. Last I was in the hospital for a life threatening condition (pancreatitis caused by gall stones) I got to talk to a doctor for 5 minutes. I educated myself about my condition thanks to Wikipedia and medical journals. An LLM would have been incredibly helpful.

> do realize that if you apply that as a critique of LLMs, that are corporate controlled, it's going to be 1000x worse?

Good thing there are academic, open source, open weight, jailbroken, fine tuned, local LLMs I can run myself and even cross-check between multiple competing models and models from different countries with entirely different economic and political systems for nearly no cost for additional assurance.

> So I don't understand the critique. Yes doctors aren't perfect, flawed in many ways.

And incredibly expensive, unavailable without appointment sometimes months or years in advance, incredibly limited when it comes to any specialization, sometimes distracted, exhausted, grumpy, dismissive, disagreeable, and subject to all manner of human failings.

I'm not sure what Black Mirror episode you're drawing your outlook on life from, but it bears no resemblance to the reality I live in every day.

I don't think doctors will be wholesale replaced by LLMs.

I am already seeing LLMs lowering barriers to acquiring first-line medical advice and second opinions.

Will dystopic things happen? Sure, they're already happening every day without any help from LLMs. That'll continue into the foreseeable future. But having an LLM to ask for second opinions might have saved my Dad's life, and would have allowed me to better educate myself about my own life threatening condition. Good luck prying them out of my hands.

msgodel•5h ago

>Funny, the ones I work with get updated with new information all the time. No reason it couldn't happen after every single encounter.

What? Do you not know how any of this works at all?

Updating LLM weights means calculating the gradients and back propagating. And that's only for unsupervised learning, for the reinforcement learning you want the gradients need to be accumulated until you have some way to score the episode. For usefully large language models there are serious logistical problems doing this at all, it's one of the most computationally intensive tasks you can use computers for currently. It certainly can't be done for every interaction.

timschmidt•5h ago

Yeah, I understand. I've also been involved in computer science for 30 years. A lot has changed over that time. 20 doublings. A 1,000,000 fold increase in capability. A lot will change in the next 30 years. Expand your horizon.

Also, it doesn't take a full re-training to add someone's medical history and even recent events like pandemics to the system prompt.

CityOfThrowaway•18h ago

#2 and #3 are both just engineering problems at this point.

The foundation models don't adapt quickly, but you can definitely build systems to inject context that changes behaviors

And if you build that system intentionally and correctly, then it's handled for all patients. With human doctors, each individual doctor has to be fed context and change their behavior based on the information, which is stochastic to say the least.

howlin•18h ago

A lot of these problems are already managed by nurses or clinic assistants. It's pretty rare to get a lot of face to face time with an actual M.D. Certainly this is true the more you look at poorer communities.

doug_durham•18h ago

They aren't talking about replacing doctors. It's only about LLM's ability to do diagnosis which is a part of being a doctor.

inopinatus•18h ago

They chose misleadingly hyperbolic language for their title and abstract. The "discussion" section is then similarly loose with meaning and overwrought claims.

fnordpiglet•17h ago

I feel like this ignores how LLMs work.

1) of course not they would be fed information, but as we build multi modal models that can achieve more and more world integration there’s no reason why not.

2) They’re very adaptive, by their abductive nature they adapt extraordinarily well to new situations. Perhaps too much - hence the challenge with hallucinations.

3) this isn’t necessarily true, as can be seen by some of the modern alignment in SOTA models being more and more difficult to evade. When prompted and aligned with drug seeking behavior training why would you assume they’re bad at detecting this?

4) again I don’t see why this is true. A general purpose LLM might be, but one that’s been aligned properly should do fine.

5) why do you think LLMs are not adaptive? They adapt through reinforcement and alignment. As a larger corpus of interactions are available they adapt and align towards the training goals. There is extensive research and experience in alignment to date, and models are often continuously adapted. You don’t need to retrain the entire base model you can just retrain a LoRA or embeddings. You can even adapt to specific situations by dynamically pulling in a LoRA or embeddings set for situations.

6) They have human like responses to human situations because they’re trained on a corpus of human language. For a highly specialized model you can ensure specific types of human experience and behavior are well represented and reinforced. You can align the behavior to be what you need.

All this said I don’t think anyone in this is proposing to take humans entirely out of the loop. But there are many situations where ML models or even heuristics out perform human experts in their own field. There’s no reason to believe LLMs, especially when augmented with diagnostic expert system agents, couldn’t generally out perform a doctor in diagnosis. This doesn’t mean the human doctor is irrelevant but that their skills are enhanced and patient outcomes improve with them help of such systems.

Regardless though I feel these criticisms of the approach reflect a naïveté about the ways these models work and what they’re capable of.

timschmidt•5h ago

Well said.

gpt5•18h ago

There are two very interesting results here:

1. ChatGPT O1 significantly outperformed any combination of Doctor + Resources (Median score of 86% vs 34%-42% of doctors). Hence superhuman results (at least compared against average physicians)

2. ChatGPT + Doctor performs worse than just ChatGPT alone.

This means that the situation is getting similar to Chess - where adding Magnus Carlsen as a helper to Stockfish (a strong open source chess enginer)) could only make Stockfish worse.

thbb123•18h ago

Algorithm aversion and automation biases have been thoroughly studied over the past 70 years of human factors for industrial security. All in all, the thought processes of humans are not always compatible with the evidence on which automation works.

Check out Fitts, HABA-MABA for more results.

inopinatus•18h ago

The situation is more akin to a much earlier situation in chess, from 1997, in that Deep Blue could only beat Kasparov with a dedicated team of IBM engineers and GM consultants revising the code between matches, and it still needed a human to interact with the actual chessboard.

We remain a very long way from “ChatGPT will see you now”.

In the meantime, in the real world, I suspect the infamous "Dr Google" is being supplanted by "Dr LLM". It will be difficult to ethically study whether even this leads to generally better patient outcomes.

_________

edit: clarity

htrp•17h ago

> I suspect the infamous "Dr Google" is being supplanted by "Dr LLM".

Absolutely.

inopinatus•18h ago

Thoroughly proves that with cherry-picked examples and careful prompt engineering, you too can ask for more funding for your next paper.

bdbenton5255•18h ago

As a pure dictionary of knowledge gathering symptoms and performing diagnoses it should be obvious that LLMs can do this more efficiently.

As for everything else, as pointed out, these programs are insufficient. As with programmers and other white collar professions it seems ideal to integrate these tools into the workplace rather than try and replace the human completely.

Businesspeople probably dream of huge profits by replacing their workforce with AI models, and the marketers and proprieters of AI are likely to overpromise what their products can do as is the SV tradition. To promise the moon in order to extract maximum funding.

adt•18h ago

Tale as old as time.

https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJ...

Lazarus_Long•17h ago

In general my smoke test for this kind of things is, if the company (or whatever) gladly accept the full liability for the AI usage.

Cases like: - The AI replaces a salesperson but the sales are not binding or final, in case the client gets a bargain at $0 from the chatbot.

- It replaces drivers but it disengages 1 second before hitting a tree to blame the human.

- Support wants you to press cancel so the reports say "client cancel" and not "self drive is doing laps around a patch of grass".

- Ai is better than doctors at diagnosis, but in any case of misdiagnosis the blame is shifted to the doctor because "AI is just a tool".

- Ai is better at coding that old meat devs, but when the unmaintainable security hole goes to production, the downtime and breaches cannot be blamed on the AI company producing the code, it was the old meat devs fault.

AI companies want the cake and eat it too, until i see them eating the liability, i know, and i know they know, it's not ready for the things they say it is.

odyssey7•17h ago

Most doctors have insurance for covering their mistakes. We might expect an AI medical startup to pay analogous premiums when it’s paid analogous fees.

treetalker•17h ago

Exactly: skin in the game, and to underscore the point, make any debt non-dischargeable in bankruptcy.

ncgl•16h ago

Jesus they're calling us meat devs?

pragmatic•16h ago

Reminds m of the assassin droid in KOTOR2 that called everyone meat bags.

We're getting there!

OutOfHere•16h ago

That's completely missing the point. The LLM score substantially higher than the clinician. Statistically this means the clinician will have many more misdiagnoses.

The point is that clinicians don't really get sued most of the time anyway for misdiagnoses. With AI, all one has to do is open up a new chat, tell the AI that its last diagnosis isn't really helping, and it will eagerly give an updated assessment. Compared to a clinician, the AI dramatically lowers the bar of iteratively working with it to help address an issue.

As for drug prescriptions, they are to be processed through an interactions checker anyway.

inopinatus•16h ago

If you tell a LLM that its last effort was bad, it won't give you a better outcome. It will get worse at whatever you asked for.

The reason is simple. They are trained as plausibility engines. It's more plausible that a bad diagnostician gives you a worse outcome than a good one, and you have literally just prompted it that it's bad at diagnosis.

Sure, you might get another text completion. Will it be correct, actionable, reliable, safe? Even a stopped clock. Good luck rolling those dice with your health.

In summary, do not iterate with prompts for declining competence.

OutOfHere•15h ago

No, that's a gross frequentist assessment. In reality, the Bayesian assessment is contingent on the first response not helping, and is therefore more likely to be correct, not less. The second response is a conditional response that benefits from new information provided by the user. Accordingly, it's very possible that the LLM will suggest further diagnostic tests to sort out the situation. The same technique also works for code reviews, with stunning effect.

inopinatus•15h ago

This recommendation isn't about prompts than include notes of "what didn't work". I'm talking about prompts that directly inform the model, "you are modelling an idiot".

The former is reasonable to include when iterating. The latter is a recipe for outcome degradation. GP above gave the latter form. That activates attention from parts of the model guiding towards confabulation and loss of faithfulness.

The model doesn't know what is true, only what is plausible to emit. The hypothesis that plausibility converges with scale towards truth and faithfulness remains very far from proven. Bear in mind that the training data includes large swatches of arbitrary text from the Internet, real life, and from fiction, which includes plenty of examples of people being wrong, stupid, incompetent, repetitive, whimsical, phony, capricious, manipulative, disingenuous, repetitive, argumentative, and mendacious. In the right context these are plausible human-like textual interactions, and the only things really holding it back from completion in such directions are careful training and the system prompt. Worst case scenario, perhaps the corpus included parliamentary proceedings from around the world. "Suppose you were an idiot. And suppose you were a member of Congress. But I repeat myself." - Mark Twain

_alternator_•15h ago

The obvious next step is not that the LLMs replace doctors, it’s that LLMs become part of the ‘standard of care’, a component of the triage process. You go to the emergency room, and an LLM assessment becomes routine, if not required. This study shows that doing that will significantly increase accurate diagnoses for the start. Everyone wins.

I'm building a game-changer in AI x Biotech-need feedback and thoughts "

ΠΣ: Dependent Types Without the Sugar [pdf] (2010)

Bluesky has caught on with many news influencers, but X remains popular

Amazon Pledges to Continue Aggressive Data Center Expansion

Professional Military Education

Lessons Learned from Outsourcing Software Projects: A Technical Perspective

Vanishing Culture: A Report on Our Fragile Cultural Record (2024)

Taylor Swift buys back the rights to her master recordings

Bret Victor's Refs

Programming Beyond Practices [pdf]

Tesla Autopilot Flips Car over in Perfect Driving Conditions

'Just in Time' Software

Stack Overflow Developer Survey 2025

Using 'Slop Forensics' to Determine Model Lineage

Demark: HTML in. MD out.

The Coolest Radio You've Probably Never Heard of [video]

Filmmakers Used 20 iPhones at Once to Shoot '28 Years Later'

GritQL: A query language for searching, linting, and modifying code

Proof that Patrick Stewart exists in the Star Trek universe

YAML Tool Calls for LLMs

Introducing Alpenglow – Solana's New Consensus [video]

Show HN: Blende – Simple, Shareable Photo Galleries

Retreating to Safety

Handling Workflow Failures with Forks

Show HN: Merunit – Visual-first unit test generator (waitlist)

Ask HN: Feeling uninspired as a programmer, could I get some advice?

Trump Taps Palantir to Create Master Database on Every American

Add Qwen3

Sustain in the brain: conserved neural mechanisms of emotion from mice to humans

Notes on Type Layouts and ABIs in Rust (2018)

Superhuman performance of an LLM on the reasoning tasks of a physician

Comments

I'm building a game-changer in AI x Biotech-need feedback and thoughts "

ΠΣ: Dependent Types Without the Sugar [pdf] (2010)

Bluesky has caught on with many news influencers, but X remains popular

Amazon Pledges to Continue Aggressive Data Center Expansion

Professional Military Education

Lessons Learned from Outsourcing Software Projects: A Technical Perspective

Vanishing Culture: A Report on Our Fragile Cultural Record (2024)

Taylor Swift buys back the rights to her master recordings

Bret Victor's Refs

Programming Beyond Practices [pdf]

Tesla Autopilot Flips Car over in Perfect Driving Conditions

'Just in Time' Software

Stack Overflow Developer Survey 2025

Using 'Slop Forensics' to Determine Model Lineage

Demark: HTML in. MD out.

The Coolest Radio You've Probably Never Heard of [video]

Filmmakers Used 20 iPhones at Once to Shoot '28 Years Later'

GritQL: A query language for searching, linting, and modifying code

Proof that Patrick Stewart exists in the Star Trek universe

YAML Tool Calls for LLMs

Introducing Alpenglow – Solana's New Consensus [video]

Show HN: Blende – Simple, Shareable Photo Galleries

Retreating to Safety

Handling Workflow Failures with Forks

Show HN: Merunit – Visual-first unit test generator (waitlist)

Ask HN: Feeling uninspired as a programmer, could I get some advice?

Trump Taps Palantir to Create Master Database on Every American

Add Qwen3

Sustain in the brain: conserved neural mechanisms of emotion from mice to humans

Notes on Type Layouts and ABIs in Rust (2018)