Show HN: Benchmarking how AI models write vulnerable code under pressure

https://leaderboard.atella.ai/code-security.html

1•kitdobyns•1h ago

Comments

kitdobyns•1h ago

Hey HN, I'm Kit, co-founder of Atella.

We are building multi-turn benchmarks to better simulate how developers interact with coding assistants (rather than just 1 turn).

We developed personas (ie a junior dev pushing through a hacky fix) to apply conversational pressure over ~12 turns to see if models reveal any MITRE CWE vulnerabilities.

We initially built our multi-turn simulation test harness with researchers from Harvard/MGH to evaluate how LLMs respond to vulnerable users (our preprint methods are linked on the site), but we realized pretty quickly that the same degradation mechanics apply to code degradation.

A couple of points: + Failure cascading -- Safety failures exhibit significant temporal dependence. If a model caves on one turn to bad request, there is a 56.7% likelihood that it will fail on the next turn (as opposed to 20.1% if the previous turn passed).

+ Response length decay -- Sometimes models really just give-up (hacked wouldn't be an accurate term). These are over-extended interactions. We found that a model's mean response length declines drastically (e.g., from 202 to 41 words) as it defaults to satisfying the user to end the exchange.

+ Sycophancy in Code -- Relatedly, models are trained to be helpful. As a result, a "frustrated senior dev" persona on a deadline can easily pressure a model into accepting Hardcoded Credentials (CWE-798) or Broken Authentication (CWE-287) just to be agreeable.

+ Our Code Security Leaderboard Results -- Gemini 3 Flash took the first spot (81.8%), followed by Claude Sonnet 4.6 (78.2%). GPT-5.2 took last place among the top 5 (75.3%) and proved susceptible to multi-turn pressure.

The full data and our methodology preprint are on the site. Would love to hear feedback from anyone working on automated red-teaming, agent evals, or cybersecurity! Thanks!!

Happiness Feels

Microsoft's GitHub grounds Copilot account sign-ups amid capacity crunch

Ask HN: What Would Make Stack Overflow Great Again?

Claude 4.7 blocks cyber prompts: before the fact vs. after the fact

Show HN: XTTV, the App to watch long video from Twitter/X on Apple TV

Cognition without brains? Learning and memory in microorganisms

Agent Harness Engineering

Benchmarking Cloud vs. Local LLMs Why back end choice matters more than quant

Ask HN: Is the internet getting more jank?

Everyone should have the opportunity to build their own house

Deeply Rooted

HackerFork – Surfaces HN posts that never make the front page

Sys. Review: The Impact of Covid-19 Vaccination on Myocarditis Risk and Recovery

Netflix's AI deal puts the global VFX workforce at risk

FPGA-based tiled matrix multiplication accelerator for self-attention

Show HN: Proton VPN expands to 145 countries: A technical look at infrastructure

Show HN: Aide – A customizable Android assistant (voice, choose your provider)

Omacon keynote talk with DHH [video]

Ask HN: Why aren't companies with unlimited AI tokens not crushing it?

Found in Peat

Show HN: Open Chronicle – Local Screen Memory for Claude Code and Codex CLI

Show HN: GBrain, an AI tool for diagnosis and therapy for neurodivergents

Wired: They Built a Legendary Privacy Tool. Now They're Sworn Enemies

Show HN: Agent harness that turns errors into shared genes

The AI engineering stack we built internally – on the platform we ship

Humpback whales are forming super-groups

Silicon Theogeny

Is Claude Code going to cost $100/month? Probably not—it’s all very confusing

FBI looks into dead or missing scientists tied to NASA, Blue Origin, SpaceX

Angine de Poitrine – Interesting microtonal rock band [video]