frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

HealthBench

https://openai.com/index/healthbench/
84•mfiguiere•3h ago

Comments

Zaheer•2h ago
Impressive how well Grok performs in these tests. Grok feels 'underrated' in terms of how much other models (gemini, llama, etc) are in the news.
tough•2h ago
you can't download grok's weights to run locally
simianwords•1h ago
how is that relevant here?
tough•1h ago
it helps explain why theres' less people talking about them than gemini or llama?

less people using them.

Insanity•1h ago
I can guarantee you none of my friends (not in tech) use “downloading weights” as an input to select an LLM application.
simianwords•1h ago
isn't chatgpt the most used or most popular model?
tough•1h ago
Yes OpenAI has a first-mover advantage and Claude seems to be close as a second player with their closed models too, open weights is not a requirement for success but in an already crowded market (grok's prospect) their preposition isn't competing neither with top tier closed models nor the maybe lesser-capable but more-available battle-tested freely available to run locally open ones
reissbaker•1h ago
You can't download Gemini's weights either, so it's not relevant as a comparison against Gemini.

I think the actually-relevant issue here is that until last month there wasn't API access for Grok 3, so no one could test or benchmark it, and you couldn't integrate it into tools that you might want to use it with. They only allowed Grok 2 in their API, and Grok 2 was a pretty bad model.

tough•1h ago
lol sorry mixed them up w gemma3 which feels like the open lesser cousin to gemini 2.5/2.0 models
moralestapia•1h ago
It's not.

Also, only one out of the ten models benchmarked have open weights, so I'm not sure what GP is arguing for.

tough•1h ago
> in terms of how much other models (gemini, llama, etc) are in the news.

not talking about TFA or benchmarks but the news coverage/user sentiment ...

ramon156•2h ago
I don't want to be a conspiracy theorist, but could this be in preparation for Amazon's (to be) health branch?
srameshc•2h ago
Is the Med-PaLM model that Google's has been working on meant to be considered for comparison ? If I'm not mistaken, it isn't publicly available.

> https://sites.research.google/med-palm/

aix1•2h ago
Med-PaLM is old and has been superseded by (multiple generations of) Gemini.
GuinansEyebrows•2h ago
i have zero trust in openai's ability to do anything impartially. why should we leave the judgement of a private tool up to the makers of the tool especially when human lives are at stake?
simianwords•1h ago
I agree - we should exercise a bit of caution here. There is no way they would release a benchmark which makes their model look bad. But then again we know that their models are one of the best for other uses so its not a big leap to accept this benchmark.
beezlebroxxxxxx•1h ago
I can already see the pharma salesmen drooling at the idea of how various symptoms can be marketed to.

"As an AI agent, a possible diagnosis is [xxx]. Ask your doctor about / look into [yyy™] for a possible solution!"

ceejayoz•1h ago
And OpenAI is definitely thinking about this on their end:

https://www.axios.com/2024/12/03/openai-ads-chatgpt

> OpenAI CFO Sarah Friar told the Financial Times that OpenAI is weighing the inclusion of ads in its products but wants to be "thoughtful about when and where we implement them."

barnas2•58m ago
Ad spots inside chatgpt are going to be worth an obscene amount of money.
amarcheschi•1h ago
I think that the damage of "chatgpt misdiagnose X as Y, person dies of Z" would be quite bad for PR
dcreater•1h ago
Isn't there an obvious conflict of interest when the model maker is also the creator of a benchmark? I think at the very least it should be from a separate business entity under the non profit or from the non profit holding entity itself
gwd•43m ago
I don't think it's necessarily bad to have the benchmark, but the graphs of Gemini and Claude doing worse than o3 did kind of leave a bad taste in my mouth. "Oh look, your models are worse than ours at this very important metric that we just made up! How terrible!"
pizzathyme•1h ago
Non-clinicians are using ChatGPT every day now to try to find assistance (right or wrong) to real-life medical problems. This is a great evaluation set that could prevent a lot of harm
unsupp0rted•1h ago
Recently I uploaded a lab report to chatGPT and asked it to summarize it.

It hallucinated serious cancer, along with all the associated details you’d normally find on a lab report. It had an answer to every question I had pre-asked about the report.

The report said the opposite: no cancer detected.

maliker•1h ago
Interesting. What LLM model? 4o, o3, 3.5? I had horrible performance with earlier models, but o3 has helped me with health stuff (hearing issues).
unsupp0rted•26m ago
Whichever the default free model is right now- I stopped paying for it when Gemini 2.5 came out in Google's AI lab.

4o, o4? I'm certain it wasn't 3.5

pants2•15m ago
If you're logged in, 4o, if you're not logged int, 4o-mini. Both don't score well on the benchmark!
askafriend•9m ago
This gets at the UX issue with AI right now. How's a normie supposed to know and understand this nuance?
maliker•10m ago
Might be worth trying again with Gemini 2.5. The reasoning models like that one are much better at health questions.
icelancer•4m ago
> Whichever the default free model is right now

Sigh. This is a point in favor of not allowing free access to ChatGPT at all given that people are getting mad at GPT-4o-mini which is complete garbage for anything remotely complex... and garbage for most other things, too.

Just give 5 free queries of 4o/o3 or whatever and call it good.

Gracana•1h ago
I wonder if it was unable to read your report, and just answered as if role-playing?

I gave it a pdf of an engine manual recently and asked some questions, which it answered reasonably. It even pulled a schematic out for me, though it was the wrong one (it gave me a schematic for the CDI ignition variant that we first talked about, rather than the DSAI one we settled on later.)

arcanemachiner•10m ago
No, cancer detected!
iNic•1h ago
I like that they include the "worst case score at k samples". This is a much more realistic view of what will happen, because someone will get that 1/100 response.
pants2•1h ago
This appears to be a very thoughtful and helpful study. It's also impressive to see the improvement in performance in just the last year of model development - almost double.

I've found o3 & deep research to be very effective in guiding my health plan. One interesting anecdote - I got hit in the chest (right over the heart) quite hard a month or so ago. I prompted o3 with my ensuing symptoms and heart rate / oxygenation data from my Apple watch, and it already knew my health history from previous conversations. It gave very good advice and properly diagnosed me with a costochondral sprain. It gave me a timeline to expect (which ended up being 100% accurate) and treatments / ointments to help.

IMO - it's a good idea to have a detailed prompt ready to go with your health history, height/weight, medications and supplements, etc. if anything's happening to you you've got it handy to give to o3 to help in a diagnosis.

simianwords•1h ago
I would really rather like a benchmark purely focusing on diagnosis. Symptoms, patient history vs the real diagnosis. Maybe name this model House M.D 1.0 or something.

The other stuff is good to have but ultimately a model that focuses on diagnosing medical conditions is going to be the most useful. Look - we aren't going to replace doctors anytime soon but it is good to have a second opinion from an LLM purely for diagnosis. I would hope it captures patterns that weren't observed before. This is exactly the sort of thing game that AI can beat a human at - large scale pattern recognition.

mrcwinn•1h ago
Happy to see this. I've struggled with an injury for the past five years. I've been to multiple sports-focused physicians, had various scans. Responses from doctors have ranged from "everything seems fine, can't really figure this out" to [completely wrong hypothesis]. Tried acupuncture. Tried a chiropractor. I remember one doctor, though, had an interesting thought that seemed to make sense - but I've been so discouraged from so many false starts or misplaced hope, I didn't bother following up.

Finally I typed in my entire history into o3-deep-research and let it rip for a while. It came back with a theory for the injury that matched that one doctor, diagrams of muscle groups and even illustrations of proposed exercises. I'm not out of the woods yet, but I am cautiously optimistic for the first time in a long time.

Noumenon72•1h ago
I hope recent cuts to government science have managed to hit enough of the safetyists and industry captures who keep us from just trying out new healthcare approaches like this and learning. They'd like nothing better than to replace the help you got with "As a large language model, I am unable to offer medical advice."
candiddevmike•57m ago
Why would you trust a LLM over a battery of human experts? I find it hard to believe that the doctors never proposed exercises or some kind of physical therapy for you, at least in the US.
BeetleB•15m ago
I can't speak to the OP's condition, but having seen plenty of doctors and physical therapists in the US for over a decade:

Yes, they propose exercises.

No, they don't work.

For certain (common) conditions, PT seems to have it nailed - the exercises really help. For the others, it's just snake oil. Not backed by much research. The current state of the art is just not good when it comes to chronic pain.

So while I don't know if an LLM can be better than a battery of human experts, I do know that those human experts do not perform well. I'm guessing with the OP's case, that battery of human experts does not lead to a consensus - you just end up with 10 different treatments/diagnoses (and occasionally, one is a lot more common than the other, but it's still wrong).

kypro•49m ago
Why are all the label colours for the "Worst-case HealthBench score at k samples" chart the same colour and the same shape? Completely unreadable.
andy99•40m ago
My sense is that these benchmarks are not realistic in terms of the way the model is used. People building specialized AI systems are not, in my experience, letting users just chat with a base model, they would have some variant of RAG plus some guardrails plus other stuff (like routing to pre-written answers for common question).

So what use case does this test setup reflect? Is there a relevant commercial use case here?

programmertote•20m ago
I have no doubt that a lot of garden-variety diagnoses and treatments can be done by an AI system that is fine-tuned and vetted to accomplish the task. I recently had to pay $93 to have a virtual session with a physician to get prescription for a cough syrup, which I already knew what to take before talking to her because I did some research/reading. Some may argue, "Doctors studied years in med school and you shouldn't trust Google more than them", but knowing human's fallibility and knowing that a lot of doctors do look things up on places like https://www.wolterskluwer.com/en/solutions/uptodate to refresh/reaffirm their knowledge, I'd argue that if we are willing to take the risk, why shouldn't we be allowed to take that risk on our own? Why do I have to pay $93 (on top of the cough syrup that costed ~$44) just so that the doctor can see me on Zoom for less than 5 mins and submit an order for the med?

With the healthcare prices increasing at the breakneck speed, I am sure AI will take more and more role in diagnosing and treating people's common illnesses, and hopefully (doubt it), the some of that savings will be transferred to the patients.

P.S. In contrast to the US system, in my home city (Rangoon, Burma/Myanmar), I have multiple clinics near my home and a couple of pharmacy within two bus stops distance. I can either go buy most of the medications I need from the pharmacy (without prescription) and take them on my own (why am I not allowed to take that risk?) OR I can go see a doctor at one of these clinics to confirm my diagnosis, pay him/her $10-$20 for the visit, and then head down to the pharmacy to buy the medication. Of course, some of the medications that include opioids will only be sold to me with the doctor's prescription, but a good number of other meds are available as long as I can afford them.

BeetleB•18m ago
Where are you that you need a prescription to get cough medicine? The only ones I know of that require prescription are the ones with controlled substances.

Note Rush: Notes to Typing Practice via AI

https://www.note-rush.com/
1•Abilash-Suresh•1m ago•1 comments

Dark Money Hit a Record High of $1.9B in 2024 Federal Races

https://www.brennancenter.org/our-work/research-reports/dark-money-hit-record-high-19-billion-2024-federal-races
2•hn_acker•1m ago•0 comments

Delivering most-favored-nation prescription drug pricing to American patients

https://www.whitehouse.gov/presidential-actions/2025/05/delivering-most-favored-nation-prescription-drug-pricing-to-american-patients/
1•prossercj•2m ago•0 comments

iOS 19 rumor: Apple using AI for battery management

https://appleinsider.com/articles/25/05/12/apple-turns-to-ai-for-battery-management-in-ios-19
1•alwillis•5m ago•0 comments

The Situation: A Used Plane That Needs Work

https://www.lawfaremedia.org/article/the-situation--a-used-plane-that-needs-work
1•hn_acker•6m ago•0 comments

What's the best AI code review tool?

https://bluedot.org/blog/best-ai-code-review-tools-2025
1•dakshgupta•9m ago•0 comments

Memories: Edinburgh ML to Standard ML

https://lawrencecpaulson.github.io/2022/10/05/Standard_ML.html
3•fanf2•9m ago•0 comments

What if humanity forgot how to make CPUs?

https://twitter.com/lauriewired/status/1922015999118680495
4•Tabular-Iceberg•13m ago•0 comments

China Just Made the Fastest Transistor and It Is Not Made of Silicon

https://www.zmescience.com/science/news-science/china-just-made-the-worlds-fastest-transistor-and-it-is-not-made-of-silicon/
2•mseri•13m ago•0 comments

I Passed the CKA and Built the Kubernetes Scenario Book I Wish I Had

1•nouhailaelg•14m ago•0 comments

Show HN: Authenticate TikTok Users Without Login Kit (Via Profile Bio)

1•DavCreator•15m ago•0 comments

Ask HN: Will AI coding help increase or decrease the use of concurrency in apps?

1•amichail•16m ago•0 comments

Philips will let you fix your trimmer with 3D printable parts and accessories

https://www.theverge.com/news/665187/philips-fixables-3d-printing-personal-health-trimmer-oneblade-prura-research-printables
2•c5karl•16m ago•0 comments

Sam Altman's eye-scanning orbs have arrived, sparking curiosity and fear

https://www.latimes.com/business/story/2025-05-12/dystopian-aesthetic-in-california-sam-altmans-eye-scanning-orbs-spark-curiosity-and-resistance
2•elsewhen•16m ago•0 comments

Bessent and Chinese minister held a secret meeting in IMF basement 3 weeks ago

https://www.ft.com/content/a541bd15-86b2-4e20-868b-c9ecca57ec09
3•cwwc•16m ago•0 comments

Animation of how LLMs make their network parameters [video]

https://www.youtube.com/watch?v=_awsxuRw9gU
2•andrewfromx•22m ago•0 comments

Louis Rossmann outs BwE owner as likely manipulative, sexpest pedophile [video]

https://www.youtube.com/watch?v=qFe5LiACN9k
2•burnt-resistor•23m ago•1 comments

Show HN: Wrkspace – 1-Click Dev Environments That Boot in Under 5 Seconds

https://wrkspace.co
1•mfcmatheus•23m ago•0 comments

High-res imaging system captures distant objects by lasers and reflection

https://physics.aps.org/articles/v18/99
2•xqcgrek2•27m ago•0 comments

Three-Volume Novel

https://en.wikipedia.org/wiki/Three-volume_novel
2•Caiero•28m ago•0 comments

Tip: Cursor works best when it has this instructions folder

https://github.com/rohitg00/CreateMVP
1•rohitghumare•32m ago•1 comments

Perplexity nears second fundraising in six months at $14B valuation

https://www.ft.com/content/7621cd5f-901e-4948-879d-011e0edb3bed
1•georgehill•33m ago•0 comments

AI models are capable of novel research

https://www.nature.com/articles/d41586-025-01485-2
1•hbartab•34m ago•0 comments

Claude's System Prompt: Chatbots Are More Than Just Models

https://www.dbreunig.com/2025/05/07/claude-s-system-prompt-chatbots-are-more-than-just-models.html
1•gdudeman•34m ago•0 comments

The Good Life, According to Gen Z

https://www.thefp.com/p/why-my-generation-is-giving-up-on
1•sorenKaram•36m ago•0 comments

Microsoft pitches for React Native in confusing world of desktop development

https://devclass.com/2025/05/12/microsoft-makes-another-pitch-for-react-native-in-confusing-world-of-windows-desktop-development/
3•JSLegendDev•39m ago•1 comments

AI Will Change What It Is to Be Human. Are We Ready?

https://www.thefp.com/p/ai-will-change-what-it-is-to-be-human
1•RobinL•40m ago•0 comments

Made a countdown clock to tell when AI will write more words than humans

https://replit.com/
1•SteveMoraco•41m ago•2 comments

Exotic mechanical properties enabled by countersnapping instabilities

https://www.pnas.org/doi/10.1073/pnas.2423301122
1•PaulHoule•43m ago•0 comments

Richard Kind Is the Perfect Second Banana

https://www.newyorker.com/culture/the-new-yorker-interview/richard-kind-is-the-perfect-second-banana
1•coloneltcb•43m ago•0 comments