"As an AI agent, a possible diagnosis is [xxx]. Ask your doctor about / look into [yyy™] for a possible solution!"
https://www.axios.com/2024/12/03/openai-ads-chatgpt
> OpenAI CFO Sarah Friar told the Financial Times that OpenAI is weighing the inclusion of ads in its products but wants to be "thoughtful about when and where we implement them."
It hallucinated serious cancer, along with all the associated details you’d normally find on a lab report. It had an answer to every question I had pre-asked about the report.
The report said the opposite: no cancer detected.
4o, o4? I'm certain it wasn't 3.5
Edit: while logged in
And as a non-normie, I obviously didn't take its analysis seriously, and compared it to Grok and Gemini 2.5. The latter was the best.
Sigh. This is a point in favor of not allowing free access to ChatGPT at all given that people are getting mad at GPT-4o-mini which is complete garbage for anything remotely complex... and garbage for most other things, too.
Just give 5 free queries of 4o/o3 or whatever and call it good.
What is happening?
I gave it a pdf of an engine manual recently and asked some questions, which it answered reasonably. It even pulled a schematic out for me, though it was the wrong one (it gave me a schematic for the CDI ignition variant that we first talked about, rather than the DSAI one we settled on later.)
FWIW,
This is what chatgpt gave me when I uploaded a lab report, it read the results back to me:
Here’s an interpretation of the blood test results you provided:
White Blood Cell Count (WBC) Value: 9.2 (Normal: 3.8 - 10.8 Thousand/uL)
Within the normal range, indicating no immediate signs of infection or inflammation.
Red Blood Cell Count (RBC) Value: 5.29 (Normal: 3.8 - 5.1 Million/uL)
Elevated (High). This could indicate dehydration, stress, or other conditions like polycythemia.
Hemoglobin Value: 10.9 (Normal: 11.7 - 15.5 g/dL)
Low. Suggests anemia, possibly due to iron deficiency or chronic disease.
Hematocrit Value: 36.0% (Normal: 35.0 - 45.0%)
Within the normal range, but on the lower side, supporting the presence of mild anemia.
Mean Corpuscular Volume (MCV) Value: 68.1 (Normal: 80.0 - 100.0 fL)
Low. Indicates microcytic anemia, often due to iron deficiency.
Mean Corpuscular Hemoglobin (MCH) Value: 20.6 (Normal: 27.0 - 33.0 pg)
Low. Suggests hypochromia (paler red blood cells), often associated with iron-deficiency anemia.
Mean Corpuscular Hemoglobin Concentration (MCHC) Value: 30.3 (Normal: 32.0 - 36.0 g/dL)
Slightly low. May reflect reduced hemoglobin in red blood cells, consistent with iron-deficiency anemia.
Red Cell Distribution Width (RDW) Value: 15.8% (Normal: 11.0 - 15.0%)
Elevated. Indicates variability in red blood cell size, common in iron-deficiency anemia.
Platelet Count Value: 397 (Normal: 140 - 400 Thousand/uL)
Normal, but on the higher side of the range.
Other Key Metrics Absolute Neutrophils, Lymphocytes, Monocytes, Eosinophils, Basophils: All within normal ranges, suggesting no significant abnormalities in immune cell populations.
Anisocytosis (1+): Indicates variation in red blood cell size.
Hypochromasia (1+): Suggests pale red blood cells, supporting iron-deficiency anemia.
Summary These results suggest iron-deficiency anemia, characterized by low hemoglobin, low MCV, low MCH, and high RDW. The elevated red blood cell count may reflect compensatory mechanisms or mild dehydration.
Your provider recommends a heart-healthy, low-fat diet, which is great for overall health. However, addressing the anemia might also involve increasing dietary iron or taking iron supplements.
The diagnosis is wrong, btw, I don't have iron deficiency. The anemia is caused by a genetic condition called thalassemia, which has been verified by genetic tests. You can use the Mentzer Index to differentiate the two on a simple CBC - https://www.mdcalc.com/calc/10534/mentzer-index
Mine numbers return a "probable diagnosis."
I was wondering if chatgpt would catch it, nope, it didn't. It did say that it was a possibility once I suggested it though.
I don't ever really use the term "thoroughly debunked" when referring to nutrition science; as you noted, a better term is that the claim is not supported by the evidence. I've seen enough things debunked and then rebunked to know that nutrition science is not really that accurate.
There was no reference in the report that the LLM might have pulled out to think otherwise.
I've found o3 & deep research to be very effective in guiding my health plan. One interesting anecdote - I got hit in the chest (right over the heart) quite hard a month or so ago. I prompted o3 with my ensuing symptoms and heart rate / oxygenation data from my Apple watch, and it already knew my health history from previous conversations. It gave very good advice and properly diagnosed me with a costochondral sprain. It gave me a timeline to expect (which ended up being 100% accurate) and treatments / ointments to help.
IMO - it's a good idea to have a detailed prompt ready to go with your health history, height/weight, medications and supplements, etc. if anything's happening to you you've got it handy to give to o3 to help in a diagnosis.
The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.
Finally I typed in my entire history into o3-deep-research and let it rip for a while. It came back with a theory for the injury that matched that one doctor, diagrams of muscle groups and even illustrations of proposed exercises. I'm not out of the woods yet, but I am cautiously optimistic for the first time in a long time.
Yes, they propose exercises.
No, they don't work.
For certain (common) conditions, PT seems to have it nailed - the exercises really help. For the others, it's just snake oil. Not backed by much research. The current state of the art is just not good when it comes to chronic pain.
So while I don't know if an LLM can be better than a battery of human experts, I do know that those human experts do not perform well. I'm guessing with the OP's case, that battery of human experts does not lead to a consensus - you just end up with 10 different treatments/diagnoses (and occasionally, one is a lot more common than the other, but it's still wrong).
No doctor or physio has ever been able to fix my chronic issues, and I've always had to figure them out myself through lots of self-study and experimentation.
I would not expect most physicians to have a deep fund of literature-backed knowledge to draw from regarding exercise. Telling someone to do an exercise probably doesn't compensate well.
That said, I'm also pretty negative about the availability of rigorous literature regarding much of nutrition, dentistry, podiatry, physical therapy, etc... you know, the things that affect the health of most human beings that have ever lived.
Because there is so much variability in individual injuries and physiology it's extremely difficult to do rigorous studies comparing different treatments. Like even something common like a rotator cuff tear isn't one single thing that can always be treated the same way. Patients and practitioners will often have to follow a trial-and-error process until they figure out what works in a particular case. Experienced providers who see a lot of cases eventually develop a lot of tacit knowledge about this in a way that's difficult to codify or explain.
I think you should take a step back and re-assess your internal heuristics.
In general a lot of those injuries will eventually heal on their own. So it's easy to fool yourself into believing that a particular treatment was effective even when the real cure was time.
So what use case does this test setup reflect? Is there a relevant commercial use case here?
For general medical Q&A I can't see how a specialized system would be better than base o3 with web search and a good prompt. If anything RAG and guardrail prompts would degrade performance.
With the healthcare prices increasing at the breakneck speed, I am sure AI will take more and more role in diagnosing and treating people's common illnesses, and hopefully (doubt it), the some of that savings will be transferred to the patients.
P.S. In contrast to the US system, in my home city (Rangoon, Burma/Myanmar), I have multiple clinics near my home and a couple of pharmacy within two bus stops distance. I can either go buy most of the medications I need from the pharmacy (without prescription) and take them on my own (why am I not allowed to take that risk?) OR I can go see a doctor at one of these clinics to confirm my diagnosis, pay him/her $10-$20 for the visit, and then head down to the pharmacy to buy the medication. Of course, some of the medications that include opioids will only be sold to me with the doctor's prescription, but a good number of other meds are available as long as I can afford them.
We have a massive, massive shortage of doctors.
The industry is doing everything they can to make it worse by the day, so I won't hold my breath that we'll get the slightest bit of respite.
It'd obviously be ideal if everyone could see a doctor for an affordable price any time they wanted.
We don't live in the ideal world.
This would be a HUGE win for most people.
https://capa-acam.ca/pa-profession/pa-facts
https://www.srh-university.de/de/folder/news/2025/04-25/erst...
https://www.bigregister.nl/over-het-big-register/cijfers/ver...
In Germany and all other countries, PAs are unable to treat patients without direct oversight and they are single digit percentages compared to NPs.
It's clear you have no experience in this area, so I wonder why the need to comment at all?
The data clearly shows that PA numbers continue increasing in many countries, so obviously they don't consider it a failed experiment and you're just lying to push some kind of personal agenda. It's clear you have no experience in this area, so I wonder why the need to comment at all?
Because you're paying for the expertise of someone who studied for more than a decade which you won't get from a random web search.
An AI system with today's technology should be less trustworthy for medical diagnosis than a web search. At least with a web search you might stumble upon a site with content from experts, assuming you trust yourself to be able to discern expert advice from bot-generated and spam content. Even if a doctor is doing the searching instead of me, I would pay them only for their knowledge to make that discernment for me. Why you think an AI could do better than a human at that is beyond me.
Your question reminds me of that famous Henry Ford GE invoice story:
> Making chalk mark on generator: $1.
> Knowing where to make mark: $9,999.
> Why you think an AI could do better than a human at that is beyond me.
Why do you think an AI couldn't do better than a human, when we have ample evidence of computers/AI exceeding humans in many areas?
I was specifically referring to the ability of discerning between accurate content and nonsense. SOTA LLMs today produce nonsensical output themselves, partly due to their training data being from poor quality sources. Cleaning up and validating training data for accuracy is an unsolved and perhaps unsolvable problem. We can't expect AI to do this for us, since this requires judgment from expert humans. And for specific applications such as healthcare, accuracy is not something you can ignore by placing a disclaimer.
This is the problem with reasoning from first principles. This statement is easily proven false by giving it a try, whether it "should" be true or not.
The trouble is you are not educated enough to tell what is simple and what isn't. A cough could be a cough or it could be something more serious, only a "real" examination will reveal that. And sometimes even that's not enough, you need an examination by a specialists.
I'll tell you a story. Once upon a time I got pain in my balls. I went to a doctor and he felt around and he said he didn't feel anything. I went to another doctor and he felt something, but he had no idea what it was. He said could be a cyst, could be a swollen vein, could be an infection - he didn't even know if it was on the testicle or on the tube thingy.
Then I went to a Urologist. You can tell this man has felt up a lot of balls. He felt me up and said, "yup, that's a tumor" almost immediately. He was right, of course, and he ended up being the one to remove it too. Since I caught the cancer pretty early the chemotherapy wasn't too intense.
Point is, expertise matters when things aren't straight forward. Then, experience and perspective gets to shine.
And even there, I bet ChatGPT would have told you to go see a doctor, since it can't feel your balls. And after your first appointment, if you had told it that you still thought something was wrong, it would probably have told you to go see a urologist.
This insanity needs to be regulated yesterday.
[1] https://psnet.ahrq.gov/primer/duty-hours-and-patient-safety
[2] https://www.fmcsa.dot.gov/sites/fmcsa.dot.gov/files/docs/Dri...
They were also found not to improve patient outcomes (possibly due to increased number of handoffs, which are highly error prone).
Because while LLMs obviously have massive limitations, so do humans, and it's not entirely clear to me that some synthesis of the two can't produce much better results than either on its own.
In theory, I agree with you. The world "some" is doing a lot of heavy lifting there though. I only hope that whatever definition of some emerges, it's not a horribly flawed one.
We see this today with AI-generated content on the web, and a flood of sloppily put together software produced by people who swear that AI is making them more productive. There's little interest in judging the output, and a lot of interest in lazy cash grabs. There are no guardrails in place in the healthcare industry AFAIA to prevent the same happening there, which is a scary thought.
The best we can do is test a human's mastery of a subject to estimate how well they actually know and understand that topic. Which is exactly what OpenAI is doing here.
What I care about is the results. If the "grade" is 10%, then I don't want to rely on it, whether it's a human or an AI. If it's 95%, then I feel fine about relying on it. Especially since I suspect that very soon, most doctors would not score as well on a benchmark like this as the SOTA models.
It’s a pity they don’t support Greek language, keeping in mind that almost all medical terminology has Greek origins.
Anyhow, this is a step in the good direction and for sure it will aid many people looking for medical assistance via ChatGPT.
Zaheer•5h ago
tough•5h ago
simianwords•5h ago
tough•5h ago
less people using them.
Insanity•5h ago
simianwords•5h ago
tough•4h ago
reissbaker•5h ago
I think the actually-relevant issue here is that until last month there wasn't API access for Grok 3, so no one could test or benchmark it, and you couldn't integrate it into tools that you might want to use it with. They only allowed Grok 2 in their API, and Grok 2 was a pretty bad model.
tough•4h ago
moralestapia•5h ago
Also, only one out of the ten models benchmarked have open weights, so I'm not sure what GP is arguing for.
tough•5h ago
not talking about TFA or benchmarks but the news coverage/user sentiment ...