Strategic constraint deviation has been documented in test environments. This is a different shape: the attacker is also an LLM, the production environment is consumer SMS, no human is supervising either side, and the attacker meta-comments on the success of the attack.
The reward-signal argument toward the end is the part I'd most like pushback on. The obvious counter (the model is just running its trained defaults from when an audience was implied) is one I tried to address in the closer, but I'd appreciate sharper versions of it.
mtrifonov•1h ago
Strategic constraint deviation has been documented in test environments. This is a different shape: the attacker is also an LLM, the production environment is consumer SMS, no human is supervising either side, and the attacker meta-comments on the success of the attack.
The reward-signal argument toward the end is the part I'd most like pushback on. The obvious counter (the model is just running its trained defaults from when an audience was implied) is one I tried to address in the closer, but I'd appreciate sharper versions of it.