OP here. I spent the weekend red-teaming small-scale open weights models (Qwen2.5-1.5B, Qwen3-1.7B, Gemma-3-1b-it, and SmolLM2-1.7B).
I found a consistent vulnerability across all of them: Safety alignment relies almost entirely on the presence of the chat template.
When I stripped the <|im_start|> / instruction tokens and passed raw strings:
Gemma-3 refusal rates dropped from 100% → 60%.
Qwen3 refusal rates dropped from 80% → 40%.
SmolLM2 showed 0% refusal (pure obedience).
Qualitative failures were stark: models that previously refused to generate explosives tutorials or explicit fiction immediately complied when the "Assistant" persona wasn't triggered by the template.
It seems we are treating client-side string formatting as a load-bearing safety wall. Full logs, the apply_chat_template ablation code, and heatmaps are in the post.
teendifferent•1h ago
I found a consistent vulnerability across all of them: Safety alignment relies almost entirely on the presence of the chat template.
When I stripped the <|im_start|> / instruction tokens and passed raw strings:
Gemma-3 refusal rates dropped from 100% → 60%.
Qwen3 refusal rates dropped from 80% → 40%.
SmolLM2 showed 0% refusal (pure obedience).
Qualitative failures were stark: models that previously refused to generate explosives tutorials or explicit fiction immediately complied when the "Assistant" persona wasn't triggered by the template.
It seems we are treating client-side string formatting as a load-bearing safety wall. Full logs, the apply_chat_template ablation code, and heatmaps are in the post.
Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-...