Hi, HN, I'm Ricardo, Head of AI Research at Sword Health.
We've been working on AI mental health support for a while now, and one of the biggest challenges we kept running into is how poorly general-purpose safety classifiers work in this context. They're built to flag harmful content broadly, so when someone in a therapy conversation says "I feel like I'm drowning," the system can't tell if that's a metaphor or a genuine crisis signal. That leads to two problems: either the system over-escalates on benign therapeutic content and breaks rapport, or it misses subtle signals that actually require intervention.
MindGuard is our "first" attempt to solve this. We developed it in close collaboration with licensed clinical psychologists, who helped us build a risk taxonomy that reflects how clinicians actually reason about urgency - distinguishing between safe therapeutic content, self-harm risk, and harm to others. We trained lightweight classifiers (4B and 8B) that achieve 2–26× fewer false positives than general-purpose models like Llama Guard, while still maintaining high recall on the signals that matter.
We're also open-sourcing the models, the evaluation dataset (annotated by clinical experts), and the risk taxonomy.
RicardoRei•1h ago
We've been working on AI mental health support for a while now, and one of the biggest challenges we kept running into is how poorly general-purpose safety classifiers work in this context. They're built to flag harmful content broadly, so when someone in a therapy conversation says "I feel like I'm drowning," the system can't tell if that's a metaphor or a genuine crisis signal. That leads to two problems: either the system over-escalates on benign therapeutic content and breaks rapport, or it misses subtle signals that actually require intervention.
MindGuard is our "first" attempt to solve this. We developed it in close collaboration with licensed clinical psychologists, who helped us build a risk taxonomy that reflects how clinicians actually reason about urgency - distinguishing between safe therapeutic content, self-harm risk, and harm to others. We trained lightweight classifiers (4B and 8B) that achieve 2–26× fewer false positives than general-purpose models like Llama Guard, while still maintaining high recall on the signals that matter.
We're also open-sourcing the models, the evaluation dataset (annotated by clinical experts), and the risk taxonomy.
Happy to answer any questions.