This paper evaluates three control strategies for untrusted agents: deferral to trusted models, resampling, and critical action deferral. Initial testing showed resampling and critical action deferral achieving 96% safety. However, adversarial testing revealed resampling crashes to 17% safety when attackers can detect resampling or simulate monitors, while critical action deferral remained robust against all attack strategies.
timini•1h ago