Results: Mistral 13B with prompt optimization achieved 81.67% accuracy vs GPT-4's 78.33% baseline - a 13% improvement while being ~20x cheaper to run.
Tested 3 approaches on 300 HarmBench samples: - Basic prompting: GPT-4 wins (78% vs 69%) - DSPy prompt optimization: Mistral 13B wins (82% vs 78%) - Multifaceted evaluation: Marginal gains (73%)
Code: https://github.com/romaingrx/llm-as-a-jailbreak-judge Detailed blog post: https://romaingrx.com/blog/llm-as-a-jailbreak-judge
Looking for feedback on the methodology and whether this cost/performance tradeoff would be useful for content moderation at scale.
romaingrx•3h ago
Results: Mistral 13B with prompt optimization achieved 81.67% accuracy vs GPT-4's 78.33% baseline - a 13% improvement while being ~20x cheaper to run.
Tested 3 approaches on 300 HarmBench samples: - Basic prompting: GPT-4 wins (78% vs 69%) - DSPy prompt optimization: Mistral 13B wins (82% vs 78%) - Multifaceted evaluation: Marginal gains (73%)
Code: https://github.com/romaingrx/llm-as-a-jailbreak-judge Detailed blog post: https://romaingrx.com/blog/llm-as-a-jailbreak-judge
Looking for feedback on the methodology and whether this cost/performance tradeoff would be useful for content moderation at scale.