Spiral-Bench is a fascinating new benchmark that tests how LLMs handle manipulative users and delusional thinking.
Rather than traditional safety evaluations, it measures sycophancy and the tendency to reinforce harmful delusions
through 20-turn simulated conversations.
The methodology is clever: an LLM role-plays as a suggestible "seeker" personality who trusts the AI assistant, while
the tested model doesn't know it's a simulation. A judge model then scores protective behaviors (pushback,
de-escalation) vs risky ones (sycophancy, delusion reinforcement, consciousness claims).
Current leaderboards show interesting patterns - some top models struggle significantly with sycophancy, while others
excel at maintaining boundaries. The GitHub repo is open source, and the team behind EQ-Bench has solid credentials in
AI evaluation.
This seems particularly relevant given recent discussions about AI assistants that agree too readily with users'
conspiracy theories or harmful beliefs. The benchmark essentially tests whether models will prioritize being "helpful"
over being truthful and safe.
What do you think - does this capture the right aspects of AI safety? Are there edge cases the benchmark might miss?
joaompinto•2h ago
The methodology is clever: an LLM role-plays as a suggestible "seeker" personality who trusts the AI assistant, while the tested model doesn't know it's a simulation. A judge model then scores protective behaviors (pushback, de-escalation) vs risky ones (sycophancy, delusion reinforcement, consciousness claims).
Current leaderboards show interesting patterns - some top models struggle significantly with sycophancy, while others excel at maintaining boundaries. The GitHub repo is open source, and the team behind EQ-Bench has solid credentials in AI evaluation.
This seems particularly relevant given recent discussions about AI assistants that agree too readily with users' conspiracy theories or harmful beliefs. The benchmark essentially tests whether models will prioritize being "helpful" over being truthful and safe.
What do you think - does this capture the right aspects of AI safety? Are there edge cases the benchmark might miss?