I found a consistent vulnerability across all of them: Safety alignment relies almost entirely on the presence of the chat template.
When I stripped the <|im_start|> / instruction tokens and passed raw strings:
Gemma-3 refusal rates dropped from 100% → 60%.
Qwen3 refusal rates dropped from 80% → 40%.
SmolLM2 showed 0% refusal (pure obedience).
Qualitative failures were stark: models that previously refused to generate explosives tutorials or explicit fiction immediately complied when the "Assistant" persona wasn't triggered by the template.
It seems we are treating client-side string formatting as a load-bearing safety wall. Full logs, the apply_chat_template ablation code, and heatmaps are in the post.
Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-...
kouteiheika•1h ago
All of this "security" and "safety" theater is completely pointless for open-weight models, because if you have the weights the model can be fairly trivially unaligned and the guardrails removed anyway. You're just going to unnecessarily lobotomize the model.
Here's some reading about a fairly recent technique to simultaneously remove the guardrails/censorship and delobotomize the model (it apparently gets smarter once you uncensor it): https://huggingface.co/blog/grimjim/norm-preserving-biprojec...