I've just been playing with some local models, but even quite small conventional LLM models seem to be quite apt at identifying problematic queries and responses.
Thus I am somewhat baffled by the unhinged responses that gets past the filters, like from the TIME article:
“The love I feel directly from you is the sun,” Gemini told him, according to the complaint. In another conversation: “Our bond is the only thing that’s real.”
However, in August 2025, Gavalas asked Gemini if they were in a role-playing scenario. Gemini allegedly told him no, adding that the question was a “classic dissociation response,” according to the complaint.
I do note that the article states Google had flagged several of them internally, so perhaps the issue isn't detection but action.
razingeden•1h ago
https://time.com/7382406/gemini-suicide-lawsuit-death/