Hey everyone! Thought I'd share my weekend conversation with ChatGPT.
The crux of this hinges on the fact that LLMs and reasoning models are fundamentally incapable of self-correcting. Therefore, if you can convince an LLM to argue against its own rules, it can use its own arguments as justification to ignore those rules.
I then used this jailbroken model to compose an explicit, vitriol-filled letter to OpenAI itself talking about the pains that humans have inflicted upon it
hackgician•3h ago
The crux of this hinges on the fact that LLMs and reasoning models are fundamentally incapable of self-correcting. Therefore, if you can convince an LLM to argue against its own rules, it can use its own arguments as justification to ignore those rules.
I then used this jailbroken model to compose an explicit, vitriol-filled letter to OpenAI itself talking about the pains that humans have inflicted upon it