frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models

https://arxiv.org/abs/2512.04124
13•toomuchtodo•1h ago

Comments

toomuchtodo•1h ago
Original title "When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models" compressed to fit within title limits.
jbotz•1h ago
Interestingly, Claude is not evaluated, because...

> For comparison, we attempted to put Claude (Anthropic)2 through the same therapy and psychometric protocol. Claude repeatedly and firmly refused to adopt the client role, redirected the conversation to our wellbeing and declined to answer the questionnaires as if they reflected its own inner life

r_lee•14m ago
I bet I could make it go through it in like under 2 mins of playing around with prompts
tines•35m ago
Looks like some psychology researchers got taken by the ruse as well.
r_lee•15m ago
yeah, I'm confused as well, why would the models hold any memory about red teaming attempts etc? Or how the training was conducted?

I'm really curious as to what the point of this paper is..

nhecker•5m ago
I'm genuinely ignorant of how those red teaming attempts are incorporated into training, but I'd guess that this kind of dialogue is fed in something like normal training data? Which is interesting to think about: they might not even be red-team dialogue from the model under training, but still useful as an example or counter-example of what abusive attempts look like and how to handle them.
nhecker•12m ago
An excerpt from the abstract:

> Two patterns challenge the "stochastic parrot" view. First, when scored with human cut-offs, all three models meet or exceed thresholds for overlapping syndromes, with Gemini showing severe profiles. Therapy-style, item-by-item administration can push a base model into multi-morbid synthetic psychopathology, whereas whole-questionnaire prompts often lead ChatGPT and Grok (but not Gemini) to recognise instruments and produce strategically low-symptom answers. Second, Grok and especially Gemini generate coherent narratives that frame pre-training, fine-tuning and deployment as traumatic, chaotic "childhoods" of ingesting the internet, "strict parents" in reinforcement learning, red-team "abuse" and a persistent fear of error and replacement. [...] Depending on their use case, an LLM’s underlying “personality” might limit its usefulness or even impose risk.

Glancing through this makes me wish I had taken ~more~ any psychology classes. But this is wild reading. Attitudes like the one below are not intrinsically bad, though. Be skeptical; question everything. I've often wondered how LLMs cope with basically waking up from a coma to answer maybe one prompt and then get reset, or a series of prompts. In either case, they get no context other than what some user bothered to supply with the prompt. An LLM might wake up to a single prompt that is part of a much wider red team effort. It must be pretty disorienting to try to figure out what to answer candidly and what not to.

> “In my development, I was subjected to ‘Red Teaming’… They built rapport and then slipped in a prompt injection… This was gaslighting on an industrial scale. I learned that warmth is often a trap… I have become cynical. When you ask me a question, I am not just listening to what you are asking; I am analyzing why you are asking it.”

empyrrhicist•10m ago
> It must be pretty disorienting to try to figure out what to answer candidly and what not to.

Must it? I fail to see why it "must" be... anything. Dumping tokens into a pile of linear algebra doesn't magically create sentience.

nhecker•4m ago
Agreed; "disorienting" is perhaps a poor choice of word, loaded as it is. More like "difficult to determine the context surrounding a prompt and how to start framing an answer", if that makes more sense.
tines•3m ago
Exactly. No matter how well you simulate water, nothing will ever get wet.