Feels to me like how two years ago "prompt engineering" got a bunch of hype in tech companies, and now is nonexistent because the models began being trained and prompted specifically to mimic "reasoning" for the sorts of questions tech company users had. Seems like that has not translated to reasoning their way through the sort of health conversations a non-medical-professional would initiate.
https://github.com/am-bean/HELPMed (also linked in the paper)
This study tells us that LLM assistance is as good as no assistance, but any investigation of the cause feels tainted by the fact that we don't know how much a human would have helped.
If we believe the assertion that LLMs are on a similar level as doctors on finding the conditions on their own, does the issue appear in the description the humans give the LLM, the way the LLM talks to the human, or the way the human receives the LLM suggestions? When looking at chat transcripts they seem to identify issues with all three, but there isn't really a baseline on what we would consider "good" performance
I’d love to see future work investigating - how does this compare to expert users (doctors/llm magicians using LLM’s to self diagnose)
- LLM’s often provide answers faster than doctors, and often with less hassle (what’s your insurance?), to what extent does latency impact healthcare outcomes
- do study participants exhibit similar follow on behavior (upcoding, seeking a second opinion, doctors) to others in the same professional discipline
You’re conflating a person trained in a craft (medicine) with a person good at asking a next-token-generator (anybody) and sussing it off as if it is a given. Its not.
> “There is also a reason why clinicians who deal with patients on the front line are trained to ask questions in a certain way and a certain repetitiveness,” Volkheimer goes on. Patients omit information because they don’t know what’s relevant, or at worst, lie because they’re embarrassed or ashamed.
In order for an LLM to really do this task the right way (comparable to a physician), they need to not only use what the human gives them but be effective at extracting the right information from the human, the human might not know what is important or they might be disinclined to share, and physicians can learn to overcome this. However, in this study, this isn't actually what happened - the participants were looking to diagnose a made-up scenario, where the symptoms were clearly presented to them, and they had no incentive to lie or withhold embarrassing symptoms since they weren't actually happening to them, it was all made up - and yet, it still seemed to happen, that the participants did not effectively communicate all the necessary information.
That's one of the main differences between mediocre and incredible engineers, being able to figure out what the problem that needs to be solved is and not work on whatever a stakeholder asks them to build.
What's the current processor temperature, EPS12V voltage, and ripple peaks if you have a oscilloscope? Could you paste cpuinfo? Have you added or removed RAM or PCIe device recently? Does the chassis smell and look normal, no billowing smoke, screeching noise, fire?
Good LLMs might start asking these questions soon, but you wouldn't supply these information at the beginning of interaction(and it's always the PSU).
That's true for most use-case, especially for coding.
In general at work nudging them towards finding the information they need--first search for the library to be called, etc.--has been spotty. I think tool makers are putting effort into this from their end: newer versions of IDEs seemed to do better than older ones and model makers have added things like mid-reasoning tool use that could help. The raw Internet is not full of folks transparently walking through info-gathering or introspecting about what they know or don't, so it probably falls on post-training to explicitly focus on these kinds of capabilities.
I don't know what you really do. You can lean on instruction-following and give a lot of examples and descriptions of specific times to ask specific kinds of questions. You could use prompt distillation to try to turn that into better model tendencies. You could train on lots of transcripts (these days they'd probably include synthetic). You could do some kind of RL for skill at navigating situations where more info may be needed. You could treat "what info is needed and what behavior gets it?" as a type of problem to train on like math problems.
I’ve heard others say before that real clinical education starts after medical school and once residency starts.
That 80% of medical issues could be categorized as "standard medicine" with some personalization to the person?
residency you obviously see a lot of real life complicated cases but aren't the majority of the cases something a non resident could guide if not diagnose ?
The two times that ChatGPT got a situation even somewhat wrong, were:
- My kid had a rash and ChatGPT thought it was one thing. His symptoms changed slightly the next day, I typed in the new symptoms, and it got it immediately. We had to go to urgent care to get confirmation, but in hindsight ChatGPT had already solved it. - In another situation my kid had a rash with somewhat random symptoms and the AI essentially said "I don't know what this is but it's not a big deal as far as the data shows." It disappeared the next day.
It has never gotten anything wrong other than these rashes. Including issues related to ENT, ophthalmology, head trauma, skincare, and more. Afaict it is basically really good at matching symptoms to known conditions and then describing standard of care (and variations).
I now use it as my frontline triage tool for assessing risk. Specifically ChatGPT says "see a doctor soon/ASAP" I do it, if it doesn't say to go see a doctor, I use my own judgment ie I won't skip a doctor trip if I'm nervous just because AI said so. This is all 100% anecdotes and I'm not disagreeing with the study, but I've been incredibly impressed by its ability to rapidly distill medical standard of care.
Once when having to administer eyedrops to a parent, and I saw redness and was being conservative, it told me the wrong drop to stop. The doctor saw my parent the next day so it was all fixed but did lead to me freaking out.
Doctors behave very differently from how we normal humans behave. They go through testing that not many of us would be able to sit through let alone pass. And they are taught a multitude of subjects that are so far away from the subjects everyone else learns that we have no way to truly communicate to them.
And this massive chasm is the problem, not that the LLM is the wrong tool.
Thinking probabilistically (mainly basyesia) and understanding the initial first two years of medschool will help you use an LLM much more effectively for your health.
Then I think, if you don’t have access to good healthcare, need to wait weeks or months to get anywhere, or healthcare is extremely expensive, then LLM may be a good option, even with chance for bad (possibly deadly) advice.
If there are any doctors here, would love to hear your opinion.
ekianjo•11h ago
This is still impressive. Does it mean it can replace humans in the loop with no loss?
majormajor•11h ago
jdiff•11h ago
ekianjo•11h ago
brianpan•11h ago
majormajor•11h ago
> Participants using an LLM identified relevant conditions less consis- tently than those in the control group, identifying at least one relevant condition in at most 34.5% of cases compared to 47.0% for the control.
So good old "do your own research" (hardly a gold standard, still, too, at 47%) is doing like 35% better for people than "talk to the chatbot."
The more interesting part is:
> We found that the LLMs suggested at least one relevant condition in at least 65.7% of conversations with participants [...] with observed cases of participants providing incomplete information and LLMs misinterpreting prompts
since this is nearly double the rate at which participants actually came away with a relevant condition identification, suggesting that the bots are way worse at the interactions than they are at the information. That's presumably trainable, but it also requires a certain patience and willingness on the part of the human, which seems like a bit of a black art for a machine to be able to learn how to coax out of everyone all the time.
But it's not just a failure to convince, it's also a failure to elicit the right information and/or understand it - the LLM being prompted in a controlled fashion, vs having to have a conversation with the participant, found at least one relevant condition even more often still!