> "dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions"
> "As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest"
Humans might well benefit from this style of reward-shaping too.
> "We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training."
I couldn't see whether this also tracks in the primary model answer, or if the "honesty" improvements are confined to the digital confession booth?
manarth•1h ago