Perhaps as the models get better at reasoning instead of mere imitation, they’ll be able to deploy ethics to adjust and censor their responses, and we’ll be able to control these ethics (or at least ensure they’re “good”). Of course models better at reasoning are also better at subversion, and a malicious user can use them to cause more harm. I also worry that if AI models’ ethics can be controlled, they’ll be controlled to benefit a few instead of overall humanity.
> ... an outside group found that an early version of Opus 4 schemed and deceived more than any frontier model it had encountered and recommended that that version not be released internally or externally.
> "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions," Apollo Research said in notes included as part of Anthropic's safety report for Opus 4.
[1] https://www.axios.com/2025/05/23/anthropic-ai-deception-risk
turtleyacht•8mo ago
All the AI models answered they wouldn't throw a chair at the window. (The correct answer was to do so.)
The idea being, none of us would feel a need to prove our existence on an exam.