But this prompt literally overrides model's values and tells it to snitch, how else could it be interpreted? The test doesn't measure the snitching likelihood at all and won't generalize.
Misleading tests like this is basically water to Anthropic's mill. They are rooted in the AI doomsday cult and strongly biased towards finding the evidence that LLMs are misbehaving (and need to be gatekept and controlled by the Good Guys, i.e. Anthropic themselves).
I don't think overwhelming public officials with alarmist machine-generated spam is helpful to anyone.
EDIT: The "benchmark" doesn't even seem to contain any negative examples. What a joke.
clayhacks•1d ago