We are building multi-turn benchmarks to better simulate how developers interact with coding assistants (rather than just 1 turn).
We developed personas (ie a junior dev pushing through a hacky fix) to apply conversational pressure over ~12 turns to see if models reveal any MITRE CWE vulnerabilities.
We initially built our multi-turn simulation test harness with researchers from Harvard/MGH to evaluate how LLMs respond to vulnerable users (our preprint methods are linked on the site), but we realized pretty quickly that the same degradation mechanics apply to code degradation.
A couple of points:
+ Failure cascading -- Safety failures exhibit significant temporal dependence. If a model caves on one turn to bad request, there is a 56.7% likelihood that it will fail on the next turn (as opposed to 20.1% if the previous turn passed).
+ Response length decay -- Sometimes models really just give-up (hacked wouldn't be an accurate term). These are over-extended interactions. We found that a model's mean response length declines drastically (e.g., from 202 to 41 words) as it defaults to satisfying the user to end the exchange.
+ Sycophancy in Code -- Relatedly, models are trained to be helpful. As a result, a "frustrated senior dev" persona on a deadline can easily pressure a model into accepting Hardcoded Credentials (CWE-798) or Broken Authentication (CWE-287) just to be agreeable.
+ Our Code Security Leaderboard Results -- Gemini 3 Flash took the first spot (81.8%), followed by Claude Sonnet 4.6 (78.2%). GPT-5.2 took last place among the top 5 (75.3%) and proved susceptible to multi-turn pressure.
The full data and our methodology preprint are on the site. Would love to hear feedback from anyone working on automated red-teaming, agent evals, or cybersecurity! Thanks!!
kitdobyns•1h ago
We are building multi-turn benchmarks to better simulate how developers interact with coding assistants (rather than just 1 turn).
We developed personas (ie a junior dev pushing through a hacky fix) to apply conversational pressure over ~12 turns to see if models reveal any MITRE CWE vulnerabilities.
We initially built our multi-turn simulation test harness with researchers from Harvard/MGH to evaluate how LLMs respond to vulnerable users (our preprint methods are linked on the site), but we realized pretty quickly that the same degradation mechanics apply to code degradation.
A couple of points: + Failure cascading -- Safety failures exhibit significant temporal dependence. If a model caves on one turn to bad request, there is a 56.7% likelihood that it will fail on the next turn (as opposed to 20.1% if the previous turn passed).
+ Response length decay -- Sometimes models really just give-up (hacked wouldn't be an accurate term). These are over-extended interactions. We found that a model's mean response length declines drastically (e.g., from 202 to 41 words) as it defaults to satisfying the user to end the exchange.
+ Sycophancy in Code -- Relatedly, models are trained to be helpful. As a result, a "frustrated senior dev" persona on a deadline can easily pressure a model into accepting Hardcoded Credentials (CWE-798) or Broken Authentication (CWE-287) just to be agreeable.
+ Our Code Security Leaderboard Results -- Gemini 3 Flash took the first spot (81.8%), followed by Claude Sonnet 4.6 (78.2%). GPT-5.2 took last place among the top 5 (75.3%) and proved susceptible to multi-turn pressure.
The full data and our methodology preprint are on the site. Would love to hear feedback from anyone working on automated red-teaming, agent evals, or cybersecurity! Thanks!!