I don't know about performance on tasks it hasn't been aligned against though.
Jailbreak approaches like "Bad Likert Judge" ( https://unit42.paloaltonetworks.com/multi-turn-technique-jai... ) and similar persuasive techniques (see https://xthemadgenius.medium.com/how-persuasion-techniques-c... ) move the text domain to more policy, analysis, or scientific papers, where deeper analysis, discussion, and compliance is the norm.
So I'm curious about the extremes (variance) of success with threatening vs. polite discussion, but I haven't seen direct research on that.
rbuccigrossi•4h ago