Clinical knowledge in LLMs does not translate to human interactions

84•insistent•12h ago

Comments

ekianjo•11h ago

> perform no better than the control group

This is still impressive. Does it mean it can replace humans in the loop with no loss?

majormajor•11h ago

What human? The control group was "instructed to instead use any methods they would typically employ at home." Most people don't have human-doctors-in-the-loop at home.

jdiff•11h ago

No, the control group was instructed to "use any methods they would typically employ at home." So ChatGPT is no better than WebMD.

ekianjo•11h ago

It's better as in, it's faster to give you an answer versus reading pages of WebMD

brianpan•11h ago

You're wrong most of the time, but at least you get there quickly.

majormajor•11h ago

Where are you getting that from? (And again, no more "human in the loop" in "reading WebMd" than "talk to chatbot.")

> Participants using an LLM identified relevant conditions less consis- tently than those in the control group, identifying at least one relevant condition in at most 34.5% of cases compared to 47.0% for the control.

So good old "do your own research" (hardly a gold standard, still, too, at 47%) is doing like 35% better for people than "talk to the chatbot."

The more interesting part is:

> We found that the LLMs suggested at least one relevant condition in at least 65.7% of conversations with participants [...] with observed cases of participants providing incomplete information and LLMs misinterpreting prompts

since this is nearly double the rate at which participants actually came away with a relevant condition identification, suggesting that the bots are way worse at the interactions than they are at the information. That's presumably trainable, but it also requires a certain patience and willingness on the part of the human, which seems like a bit of a black art for a machine to be able to learn how to coax out of everyone all the time.

But it's not just a failure to convince, it's also a failure to elicit the right information and/or understand it - the LLM being prompted in a controlled fashion, vs having to have a conversation with the participant, found at least one relevant condition even more often still!

dosinga•11h ago

Really what it seems to say is that LLMs are pretty good at identifying underlying causes and recommending medical actions but if you let humans use LLMs to self diagnose the whole thing falls apart, if I read this correctly

majormajor•11h ago

Yeah it sounds like "LLMs are bad at interacting with lay humans compared to being prompted by experts or being given well-formed questions like from licensing exams."

Feels to me like how two years ago "prompt engineering" got a bunch of hype in tech companies, and now is nonexistent because the models began being trained and prompted specifically to mimic "reasoning" for the sorts of questions tech company users had. Seems like that has not translated to reasoning their way through the sort of health conversations a non-medical-professional would initiate.

wongarsu•11h ago

And there seem to be concrete results that would allow you to improve the LLM prompt to make these interactions more successful. Apparently giving the human 2-3 possible options and letting the human have the final choice was a big contributor to the bad results. Their recommendations go the route of "the model should explain it better" but maybe the best results would be achieved if the model was prompted to narrow it down until there is only one likely diagnosis left. This is more or less how doctors operate after all.

bryant•11h ago

For anyone keen on dissecting this further, they uploaded enough to github for people to dive into their approach in depth.

https://github.com/am-bean/HELPMed (also linked in the paper)

wongarsu•11h ago

That's an interesting result. I would love to see a follow-up with two control groups: humans with assistance from an LLM, humans with assistance from a doctor and humans with no assistance.

This study tells us that LLM assistance is as good as no assistance, but any investigation of the cause feels tainted by the fact that we don't know how much a human would have helped.

If we believe the assertion that LLMs are on a similar level as doctors on finding the conditions on their own, does the issue appear in the description the humans give the LLM, the way the LLM talks to the human, or the way the human receives the LLM suggestions? When looking at chat transcripts they seem to identify issues with all three, but there isn't really a baseline on what we would consider "good" performance

dhash•11h ago

I love this kind of research since it correctly identifies some issues with the way the public interacts with LLM’s. Thank you for the evening reading!

I’d love to see future work investigating - how does this compare to expert users (doctors/llm magicians using LLM’s to self diagnose)

- LLM’s often provide answers faster than doctors, and often with less hassle (what’s your insurance?), to what extent does latency impact healthcare outcomes

- do study participants exhibit similar follow on behavior (upcoding, seeking a second opinion, doctors) to others in the same professional discipline

dgfitz•10h ago

> how does this compare to expert users

You’re conflating a person trained in a craft (medicine) with a person good at asking a next-token-generator (anybody) and sussing it off as if it is a given. Its not.

Fripplebubby•11h ago

Interesting quote from the venturebeat article linked:

> “There is also a reason why clinicians who deal with patients on the front line are trained to ask questions in a certain way and a certain repetitiveness,” Volkheimer goes on. Patients omit information because they don’t know what’s relevant, or at worst, lie because they’re embarrassed or ashamed.

In order for an LLM to really do this task the right way (comparable to a physician), they need to not only use what the human gives them but be effective at extracting the right information from the human, the human might not know what is important or they might be disinclined to share, and physicians can learn to overcome this. However, in this study, this isn't actually what happened - the participants were looking to diagnose a made-up scenario, where the symptoms were clearly presented to them, and they had no incentive to lie or withhold embarrassing symptoms since they weren't actually happening to them, it was all made up - and yet, it still seemed to happen, that the participants did not effectively communicate all the necessary information.

TZubiri•10h ago

As a patient, I am responsible for sharing information to my doctor. I wouldn't hold it against them if they didn't extract information from me.

smogcutter•10h ago

Sure, but think of a good help desk tech: if they waited for users to accurately report useful information, nothing would ever get fixed.

SecretDreams•10h ago

Sure. But as a patient, you are also not expected to know what is or isn't important. Omitting unimportant information (to you) because your brain does a low pass filter is partially what the doctor is trying to bypass.

mlinhares•9h ago

Its as if every single person had to be an expert in every field to be able to function, that's really not a thing and we expect the actual experts to know how to extract the needed information.

That's one of the main differences between mediocre and incredible engineers, being able to figure out what the problem that needs to be solved is and not work on whatever a stakeholder asks them to build.

mumbisChungo•9h ago

Yeah, there's a lot of agency on both sides of the equation when it comes to any kind of consultant. You're less likely to have bad experiences with doctors if you're self aware and thoughtful about how you interact with them.

BriggyDwiggs42•9h ago

You don’t want to make systems that require people to be as diligent as you because those systems will have bad outcomes.

numpad0•8h ago

Okay, so your code has been segfaulting at line 123 in complicated_func.cpp, and you want to know to which version of libc you have to roll back to as well as related packages if any.

What's the current processor temperature, EPS12V voltage, and ripple peaks if you have a oscilloscope? Could you paste cpuinfo? Have you added or removed RAM or PCIe device recently? Does the chassis smell and look normal, no billowing smoke, screeching noise, fire?

Good LLMs might start asking these questions soon, but you wouldn't supply these information at the beginning of interaction(and it's always the PSU).

littlestymaar•5h ago

> In order for an LLM to really do this task the right way (comparable to a physician), they need to not only use what the human gives them but be effective at extracting the right information from the human

That's true for most use-case, especially for coding.

twotwotwo•10h ago

At work, one of the prompt nudges that didn't work was asking it to ask for clarifications or missing info rather than charge forward with a guess. "Sometimes do X" instructions don't do well generally when the trigger conditions are fuzzy. (Or complex but stated in few words like "ask for missing info.") I can believe part of the miss here would be not asking the right questions--that seems to come up in some of their sample transcripts.

In general at work nudging them towards finding the information they need--first search for the library to be called, etc.--has been spotty. I think tool makers are putting effort into this from their end: newer versions of IDEs seemed to do better than older ones and model makers have added things like mid-reasoning tool use that could help. The raw Internet is not full of folks transparently walking through info-gathering or introspecting about what they know or don't, so it probably falls on post-training to explicitly focus on these kinds of capabilities.

I don't know what you really do. You can lean on instruction-following and give a lot of examples and descriptions of specific times to ask specific kinds of questions. You could use prompt distillation to try to turn that into better model tendencies. You could train on lots of transcripts (these days they'd probably include synthetic). You could do some kind of RL for skill at navigating situations where more info may be needed. You could treat "what info is needed and what behavior gets it?" as a type of problem to train on like math problems.

zora_goron•9h ago

This difference between medical board examinations and real world practice is something that mirrors my real-world experience too, having finished med school and started residency a year ago.

I’ve heard others say before that real clinical education starts after medical school and once residency starts.

keeptrying•8h ago

Could you elaborate on what you mean?

That 80% of medical issues could be categorized as "standard medicine" with some personalization to the person?

residency you obviously see a lot of real life complicated cases but aren't the majority of the cases something a non resident could guide if not diagnose ?

hackitup7•8h ago

This is just a random anecdote but ChatGPT (when given many, many details with 100% honesty) has essentially matched exactly what doctors told me in every case where I've tested it. This was across several non-serious situations (what's this rash) and one quite serious situation, although the last is a decently common condition.

The two times that ChatGPT got a situation even somewhat wrong, were:

- My kid had a rash and ChatGPT thought it was one thing. His symptoms changed slightly the next day, I typed in the new symptoms, and it got it immediately. We had to go to urgent care to get confirmation, but in hindsight ChatGPT had already solved it. - In another situation my kid had a rash with somewhat random symptoms and the AI essentially said "I don't know what this is but it's not a big deal as far as the data shows." It disappeared the next day.

It has never gotten anything wrong other than these rashes. Including issues related to ENT, ophthalmology, head trauma, skincare, and more. Afaict it is basically really good at matching symptoms to known conditions and then describing standard of care (and variations).

I now use it as my frontline triage tool for assessing risk. Specifically ChatGPT says "see a doctor soon/ASAP" I do it, if it doesn't say to go see a doctor, I use my own judgment ie I won't skip a doctor trip if I'm nervous just because AI said so. This is all 100% anecdotes and I'm not disagreeing with the study, but I've been incredibly impressed by its ability to rapidly distill medical standard of care.

forgetfreeman•8h ago

I sincerely hope your credulity doesn't swing around to bite you in the ass with this.

brundolf•6h ago

I wonder if the software developer mindset plays into this. We're really good at over-reporting all possibly-relevant information for "debugging" purposes

extr•6h ago

I've had an identical experience of ChatGPT misidentifying my kid's rash. In my case I would say it got points for being in the same ballpark - it guessed HFM, the real answer was "an unnamed similar-ish virus to HFM but not HFM proper". The treatment was the same, just let it run it's course and our kid was fine. But I think it also made me realize that our pediatrician is still quite important in the sense that she has local, contextual, geography-based knowledge of what other kids in the area are experiencing too. She recognized it immediately because she had already seen 2 dozen other kids with it in the last month. That's going to be hard for any AI system to replicate until some distant time when all healthcare data is fed into The Matrix.

keeptrying•8h ago

I've seen that LLMs hallucinate in very subtle ways when guidng you through a course of treatment.

Once when having to administer eyedrops to a parent, and I saw redness and was being conservative, it told me the wrong drop to stop. The doctor saw my parent the next day so it was all fixed but did lead to me freaking out.

Doctors behave very differently from how we normal humans behave. They go through testing that not many of us would be able to sit through let alone pass. And they are taught a multitude of subjects that are so far away from the subjects everyone else learns that we have no way to truly communicate to them.

And this massive chasm is the problem, not that the LLM is the wrong tool.

Thinking probabilistically (mainly basyesia) and understanding the initial first two years of medschool will help you use an LLM much more effectively for your health.

pyman•8h ago

Interesting paper. LLMs have the knowledge but lack social skills. they fail when interacting with real patients. So, maybe, the real bottleneck isn't knowledge after all?

im3w1l•56m ago

Not sure if bottleneck is the right word. Like it seems more like something that was forgotten about but turned out to be important, and that we know it matters, might not be too hard to fix.

callc•6h ago

My immediate reaction is “absolutely not”. Unless the healthcare provider is willing to accept liability for the output and recommendations of their LLM. Are they willing to put their money to where their mouth is? Or are they just trying to reduce cost, increase profit?

Then I think, if you don’t have access to good healthcare, need to wait weeks or months to get anywhere, or healthcare is extremely expensive, then LLM may be a good option, even with chance for bad (possibly deadly) advice.

If there are any doctors here, would love to hear your opinion.

The joy of (type) sets in Go

The Archaeological and Historical Sites and Monuments Index

Why Generative AI Coding Tools and Agents Do Not Work for Me

Update Docs

Zephyr Abstract Syntax Definition Language [pdf]

Foundations of Computer Vision

Ireland became the Saudi Arabia of siphoned-off global profits

Ask HN: How to increase LLM inference speed?

Show HN: Building Hugo – An AI coding agent

Ask HN: What's the coolest AI project you've seen?

Ask HN: AGI and Product Development

23andMe founder buys back genetic testing company in second auction

Short: What happens if you turn the horizon into music? [video]

Creating One platform from travel planning to booking

The Intelligence Curse

Ask HN: Is it still a good idea to learn Perl for a young developer?

Setting Up a Free One-Handed Touch-Typing System on Your PC

Is there interest in an informal proof that Graph Vertex 3-Colouring is NPC?

Notes on the History of the Map Tile

Notes on Managing ADHD

Lincoln Steampunk Festival

Vesna Vulović

Meta AI searches made public – but do all its users realise?

NASA jettisons Neo4j database for Memgraph citing costs

In Munich, early signs of a European hyperscaler revolt

I Am So Tired

The De Man Case (2014)

Writing Load Balancer from Scratch in 250 Line of Code

What Will the World Cup Stand for in an Isolationist America?

De-EscalatingSocial Media – Designing humility and forgiveness into social media

Clinical knowledge in LLMs does not translate to human interactions

Comments

The joy of (type) sets in Go

The Archaeological and Historical Sites and Monuments Index

Why Generative AI Coding Tools and Agents Do Not Work for Me

Update Docs

Zephyr Abstract Syntax Definition Language [pdf]

Foundations of Computer Vision

Ireland became the Saudi Arabia of siphoned-off global profits

Ask HN: How to increase LLM inference speed?

Show HN: Building Hugo – An AI coding agent

Ask HN: What's the coolest AI project you've seen?

Ask HN: AGI and Product Development

23andMe founder buys back genetic testing company in second auction

Short: What happens if you turn the horizon into music? [video]

Creating One platform from travel planning to booking

The Intelligence Curse

Ask HN: Is it still a good idea to learn Perl for a young developer?

Setting Up a Free One-Handed Touch-Typing System on Your PC

Is there interest in an informal proof that Graph Vertex 3-Colouring is NPC?

Notes on the History of the Map Tile

Notes on Managing ADHD

Lincoln Steampunk Festival

Vesna Vulović

Meta AI searches made public – but do all its users realise?

NASA jettisons Neo4j database for Memgraph citing costs

In Munich, early signs of a European hyperscaler revolt

I Am So Tired

The De Man Case (2014)

Writing Load Balancer from Scratch in 250 Line of Code

What Will the World Cup Stand for in an Isolationist America?

De-EscalatingSocial Media – Designing humility and forgiveness into social media