Medical triage in our context means whether symptoms require emergency care, urgent care, or can be managed with self-care. This matters because LLMs are increasingly becoming the “digital front door” for health concerns—replacing the instinct to just Google it.
Getting triage wrong can be dangerous (missed emergencies) or costly (unnecessary ER visits).
We’ve open-sourced TriageBench, a reproducible framework for evaluating LLM triage accuracy. It includes:
- A standard clinical dataset (Semigran vignettes)
- Paired McNemar’s test to detect model performance differences on small datasets
As a demonstration, we benchmarked our own model (MedAsk) against several OpenAI models:
- MedAsk: 87.6% accuracy
- o3: 75.6%
- GPT‑4.5: 68.9%
The main limitation is dataset size (45 vignettes). We're looking for collaborators to help expand this - the field needs larger, more diverse clinical datasets.
On the other hand, we can also diagnose LLM itself: the activation value is their EEG, the gradient is their BOLD - if you are at the cost, you can even calculate their true variational free energy - that is, KL divergence.
"Don't just train your model, understand its mind."
klemenvod•5h ago
Getting triage wrong can be dangerous (missed emergencies) or costly (unnecessary ER visits).
We’ve open-sourced TriageBench, a reproducible framework for evaluating LLM triage accuracy. It includes:
- A standard clinical dataset (Semigran vignettes)
- Paired McNemar’s test to detect model performance differences on small datasets
- Full methodology and evaluation code
GitHub: https://github.com/medaks/medask-benchmark
As a demonstration, we benchmarked our own model (MedAsk) against several OpenAI models:
- MedAsk: 87.6% accuracy
- o3: 75.6%
- GPT‑4.5: 68.9%
The main limitation is dataset size (45 vignettes). We're looking for collaborators to help expand this - the field needs larger, more diverse clinical datasets.
Blog post with full results: https://medask.tech/blogs/medical-ai-triage-accuracy-2025-me...