Gemini-3 : 80% Claude-Opus-4.7 : 0%
These tests are interesting even with the understanding that the AI is just reciprocating its training. It doesn't matter if the model is conscious or self aware if it still goes off the rails breaking things when prompted in this way.
As the article linked at the end of the tweet thread (https://www.arimlabs.ai/writing/loss-of-control) puts it, this is a class of vulnerability distinct from hallucination or prompt injection. The "AI apocalypse" bit was unnecessary in the title though, really doesn't match the message of the text.
Reminds me of a (computerphile?) video I watched some time before the LLM revolution, discussing the challenge of aligning AI towards specific goals, if you set the reward for the emergency shutoff button higher than or equal to the primary objective, the AI is encouraged to immediately press the button itself, but if you the reward lower, it's encouraged to prevent you from pressing the button.
perrygeo•1h ago
LLM: "I am Alive"
Human: OMG
(credit to https://old.reddit.com/r/coaxedintoasnafu/comments/1qtavj9/c...)
mykytamudryi•1h ago
no-name-here•31m ago
But I think this and the other testing from Anthropic about LLMs being willing to kill a data center tech by flooding a room with gas (or blackmail them with their Google Drive files) to avoid being shut off, for example, is concerning - the important part isn't whether AI are trained on human behaviors, it's whether a good or bad human actor will accidentally or intentionally allow AI to control something that can hurt people, or a weapon, etc. Fiction like the Three Laws of Robotics at least assumed that we would try to put in place stronger 'laws' before allowing AIs to control such things. I think the Three Laws, Skynet, etc. were intended to be cautionary tales.