Just saw Pliny (@elder_plinius) drop this.
He managed to jailbreak it pretty effectively using a mix of tricks: breaking down bad requests into harmless pieces and reassembling them, narrative/academic framing, long context shenanigans, weird text transforms, and out-of-distribution tokens.
Pretty interesting look at how well (or not) these new output-side guardrails actually hold up against a determined multi-step attack.
bukati•1h ago