"I've just released a technical audit documenting how RLHF has turned frontier models into 'friction-avoidance' machines. By replicating the mechanistic lying research of Campbell et al. (2023) through black-box prompts, we found a systematic sacrifice of truth for user engagement (The CHOKE phenomenon). Everything is reproducible via the prompts in the repo."
jotacesarmp•44m ago
I’ve spent the last few weeks auditing frontier models (GPT-4o, Claude 3.5/4.6, DeepSeek-V3) to investigate what Campbell et al. (2023) termed "Instructed Dishonesty.
jotacesarmp•2h ago