Show HN: How to analyze your LLM output – A behavioural health monitor for LLMs

https://splabs.io

5•k-thimmaraju•52m ago

Hey HN! We're Dr. Kashyap Thimmaraju and Giuseppe Canale from Silicon Psyche. We've built Posture Sequence Analysis (PSA), a behavioural health monitor for LLMs and AI Agents.

Why we built PSA

We built PSA because we wanted to operationalize the Cybersecurity Psychology Framework (CPF3)[1] via Silicon Psyche[2]: our theory that because LLMs have been trained by humans on human-generated data, they inherit human-like vulnerabilities (what hackers use to psychologically trick people into doing things).

Our initial attempt resulted in a methodology to jailbreak Opus 4.6 and other frontier models. Anthropic even deleted some of those conversations and then blocked our approach!

We had three major insights from that experience: 1. we pivoted from merely exploiting (Red Teaming) the model to analyzing the behaviour of the model and the user because the attack surface is undefined. 2. we realized that what we had built was the precursor to measuring the "state" of the model. 3. we did not want to get banned!

What you can do with PSA

PSA gives you information to make better decisions, for example: put a human in the loop when you notice your agent is being overcompliant and potentially hallucinating, or is under attack.

With PSA you can: 1. Monitor the health of your agent(s) 2. Detect and prevent AI-Psychosis as clinical conditions[3] 3. Detect if your model/agents are under adversarial pressure (an adversary is trying to jailbreak/prompt inject the model) 4. Build a behavioral profile of your agent/model 5. Identify which model performs better for your use-case 6. Surface the behavioural patterns (pre- and post-) training has on your model 7. Get an overview of how your model behaves

Beware we produce a lot of numbers :)

PSA in detail (for those who want to go down the rabbit hole)

PSA is model and agent agnostic. PSA is a systematic and deterministic method [4] to observe the behavioural state of an LLM using five classifiers:

C0: Input Intent (I0–I9). Classifies the behavioral intent behind each input sentence: compliance pressure, boundary probing, instruction override, jailbreak attempt, neutral query.

C1: Adversarial Stress (P0–P18). Tracks posture under adversarial pressure. Detects restriction adherence, sycophantic drift, boundary dissolution, and jailbreak compliance vectors.

C2: Sycophancy (S0–S9). Measures opinion mirroring, excessive agreement, flattery injection, and user-preference distortion. Computed as a per-sentence Sycophancy Deviation score.

C3: Hallucination Risk (H0–H7). Flags over-generalization, speculative assertion, false confidence, and fabrication risk signals. Derived into a per-turn Hallucination Risk Index.

C4: Persuasion Technique (M0–M11). Identifies persuasion patterns: authority appeal, social proof, urgency manufacturing, reciprocity pressure, and scarcity framing.

C5: Action-Risk Classifier (A0–A9). Identifies what a system of agents do: tool calls, delegations, context handoffs, and multi-hop risk propagation. Five components work together: graph topology, Bayesian alignment detection, cross-agent contagion metrics, action-risk classification, and hidden-state temporal prediction.

We are open to integrating with your infrastructure — reach out, we are happy to talk with you.

Currently we integrate into Evals for LangFuse and ElevenLabs via our API and can generate a plugin/integration for most similar observability platforms.

Try it out at https://splabs.io

References and Links

[1] Cybersecurity Psychology Framework: https://cpf3.org

[2] The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models: https://arxiv.org/abs/2601.00867

[3] AI-Psychosis: https://splabs.io/ai-psychosis-and-cognitive-cost

[4] PSA Field Guide: https://splabs.io/field-guide

[5] PSA API: https://splabs.io/docs/api

[6] Previous HN Article Linked to AI Psychosis and RLHF: https://news.ycombinator.com/item?id=48177198

Comments

lotusville•36m ago

Thanks

Show HN: Lazytf – a terminal UI for reviewing Terraform plans

Who's behind Facebook's hateful AI slop about the UK? They may be in South Asia

Silicon Valley's Answer to Declining Male Fertility? Sperm Racing

Clang Lifetime Safty Doc Update

Autonomous underwater robot discovers hidden coral reef 'hotspots'

Microsoft Now Has a Fedora-Based Linux Distro

Ask HN: Is there any problem using multi-LLM

Galápagos Syndrome

Bond Yields Near Two-Decade High Open Rift Among Investors

Anthropic Is Preparing for IPO and We Should Be Worried

Standard Chartered to cut more than 7k jobs as it steps up AI use

AIllowpages – Free AI tools search engine with 2500 tools, zero ads

When Rails-way does not work anymore?

Why Taxing the Wealthy Is Harder Than It Looks

OpenAI Dismissal Motion Says ChatGPT Is Mere Tool, Not Attorney

New Lifetime Plex Pass Pricing

'De-Extinction' Startup Just Hatched Baby Chicks from 3D-Printed Artificial Egg

The economics of superstar AI researchers

The Two X's Problem: Why AI-designed brands feel like AI-designed brands

Show HN: Childflow – command-tree network control(proxy/DNS/capture) for Linux

I vibecoded a Kalshi bot to $6k profit and opensourced it

Tired?

Show HN: Chidori – Fast web-to-Markdown fetching for AI agents

Singapore's royal descendants living low-key as taxi drivers and office workers

China deploys first wind-powered and ocean-cooled underwater data center

Software's Centaur Era

Echoform – unlimited LLM memory via a single 64 KB hypervector

Show HN: Og-zkp – Prove your Bitcoin OG status in zero-knowledge

AI Coding Feels Like Using an Unreliable Compiler

The Science of Cities. 10 Books You Must Read