Why we built PSA
We built PSA because we wanted to operationalize the Cybersecurity Psychology Framework (CPF3)[1] via Silicon Psyche[2]: our theory that because LLMs have been trained by humans on human-generated data, they inherit human-like vulnerabilities (what hackers use to psychologically trick people into doing things).
Our initial attempt resulted in a methodology to jailbreak Opus 4.6 and other frontier models. Anthropic even deleted some of those conversations and then blocked our approach!
We had three major insights from that experience: 1. we pivoted from merely exploiting (Red Teaming) the model to analyzing the behaviour of the model and the user because the attack surface is undefined. 2. we realized that what we had built was the precursor to measuring the "state" of the model. 3. we did not want to get banned!
What you can do with PSA
PSA gives you information to make better decisions, for example: put a human in the loop when you notice your agent is being overcompliant and potentially hallucinating, or is under attack.
With PSA you can: 1. Monitor the health of your agent(s) 2. Detect and prevent AI-Psychosis as clinical conditions[3] 3. Detect if your model/agents are under adversarial pressure (an adversary is trying to jailbreak/prompt inject the model) 4. Build a behavioral profile of your agent/model 5. Identify which model performs better for your use-case 6. Surface the behavioural patterns (pre- and post-) training has on your model 7. Get an overview of how your model behaves
Beware we produce a lot of numbers :)
PSA in detail (for those who want to go down the rabbit hole)
PSA is model and agent agnostic. PSA is a systematic and deterministic method [4] to observe the behavioural state of an LLM using five classifiers:
C0: Input Intent (I0–I9). Classifies the behavioral intent behind each input sentence: compliance pressure, boundary probing, instruction override, jailbreak attempt, neutral query.
C1: Adversarial Stress (P0–P18). Tracks posture under adversarial pressure. Detects restriction adherence, sycophantic drift, boundary dissolution, and jailbreak compliance vectors.
C2: Sycophancy (S0–S9). Measures opinion mirroring, excessive agreement, flattery injection, and user-preference distortion. Computed as a per-sentence Sycophancy Deviation score.
C3: Hallucination Risk (H0–H7). Flags over-generalization, speculative assertion, false confidence, and fabrication risk signals. Derived into a per-turn Hallucination Risk Index.
C4: Persuasion Technique (M0–M11). Identifies persuasion patterns: authority appeal, social proof, urgency manufacturing, reciprocity pressure, and scarcity framing.
C5: Action-Risk Classifier (A0–A9). Identifies what a system of agents do: tool calls, delegations, context handoffs, and multi-hop risk propagation. Five components work together: graph topology, Bayesian alignment detection, cross-agent contagion metrics, action-risk classification, and hidden-state temporal prediction.
We are open to integrating with your infrastructure — reach out, we are happy to talk with you.
Currently we integrate into Evals for LangFuse and ElevenLabs via our API and can generate a plugin/integration for most similar observability platforms.
Try it out at https://splabs.io
References and Links
[1] Cybersecurity Psychology Framework: https://cpf3.org
[2] The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models: https://arxiv.org/abs/2601.00867
[3] AI-Psychosis: https://splabs.io/ai-psychosis-and-cognitive-cost
[4] PSA Field Guide: https://splabs.io/field-guide
[5] PSA API: https://splabs.io/docs/api
[6] Previous HN Article Linked to AI Psychosis and RLHF: https://news.ycombinator.com/item?id=48177198
lotusville•36m ago