Vulnerability research is cooked

https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/

87•pedro84•3h ago

Comments

nitros•1h ago

I'm suspicious of this prediction given the curl project's experiences...

tptacek•1h ago

Everybody agrees that idiots were spamming curl with random just-plausible-enough-seeming output from old models.

tomjakubowski•1h ago

It sounds like what makes the pipeline in the article effective is the second stage, which takes in the vulnerability reports produced by the first level and confirms or rejects them. The article doesn't say what the rejection rate is there.

I don't think the spammers would think to write the second layer, they would most likely pipe the first layer (a more naive version of it too, probably) directly to the issue feed.

tptacek•1h ago

There are at least three differences:

* Carlini's team used new frontier models that have gotten materially better at finding vulnerabilities (talk to vulnerability researchers outside the frontier labs, they'll echo that). Stenberg was getting random slop from people using random models.

* Carlini's process is iterated exhaustively over the whole codebase; he's not starting with a repo and just saying "find me an awesome bug" and taking that and only that forward in the process.

* And then yes, Carlini is qualifying the first-pass findings with a second pass.

wslh•1h ago

The problem is that you have all kind of "security spam" in the same way that social media is flooded by automatic, but on-topic, content. This doesn't mean that some very few reports are not correct.

One way to filter that out could be to receive the PoC of the exploit, and test it in some sandbox. I think what XBOW and others are doing is real.

jerf•1h ago

The people spamming curl did step one, "write me a vulnerability report on X" but skipped step two, "verify for me that it's actually exploitable". Tack on a step three where a reasonably educated user in the field of security research does a sanity check on the vulnerability implementation as well and you'll have a pipeline that doesn't generate a ton of false positives. The question then will rather be how cost-effective it is for the tokens and the still-non-zero human time involved.

MajesticHobo2•1h ago

That was then, this is now. The new models are scarily good. If you're skeptical, just take an hour to replicate the strategy the article references. Point Claude at any open-source codebase you find interesting and instruct it to find exploitable vulnerabilities. Give it a well-defined endpoint if you want (e.g., "You must develop a Python script that triggers memory corruption via a crafted request") and see how well it does.

phyzome•35m ago

> That was then, this is now.

No, what we were seeing with curl was script kiddies. It wasn't about the quality of the models at all. They were not filtering their results for validity.

tomjakubowski•1h ago

> Now consider the poor open source developers who, for the last 18 months, have complained about a torrent of slop vulnerability reports. I’d had mixed sympathies, but the complaints were at least empirically correct. That could change real fast. The new models find real stuff.

The slop reports won't stop just because real ones are coming in. If the author's right, open source maintainers will still will have to deal with the torrent of slop: on top of triaging and identifying the legit vulnerabilities. Obviously, this is just another role for AI models to fill.

stavros•1h ago

I don't understand why the takeaway here is (unless I'm missing something), more or less "everything is going to get exploited all the time". If LLMs can really find a ton of vulnerabilities in my software, why would I not run them and just patch all the vulnerabilities, leading to perfectly secure software (or, at the very least, software for which LLMs can no longer find any new vulnerabilities)?

tptacek•1h ago

That might be one outcome, especially for large, expertly-staffed vendors who are already on top of this stuff. My real interest in what happens to the field for vulnerability researchers.

lifty•1h ago

Perhaps a meta evolution, they become experts at writing harnesses and prompts for discovering and patching vulnerabilities in existing code and software. My main interest is, now that we have LLMs, will the software industry move to adopting techniques like formal verification and other perhaps more lax approaches that massively increase the quality of software.

stavros•58m ago

True, but I already am curious to see what happens in a multitude of fields, so this is just one more entry in that list.

underdeserver•10m ago

Just wanted to point out that tptacek is the blog post's author (and a veteran security researcher).

layer8•1h ago

The pressure to do so will only happen as a consequence of the predicted vulnerability explosion, and not before it. And it will have some cost, as you need dedicated and motivated people to conduct the vulnerability search, applying the fixes, and re-checking until it comes up empty, before each new deployment.

The prediction is: Within the next few months, coding agents will drastically alter both the practice and the economics of exploit development. Frontier model improvement won’t be a slow burn, but rather a step function. Substantial amounts of high-impact vulnerability research (maybe even most of it) will happen simply by pointing an agent at a source tree and typing “find me zero days”.

cartoonworld•1h ago

I feel like the dream of static analysis was always a pipe.

When the payment for vulns drops i'm wondering where the value is for hackers to run these tools anymore? The LLMs don't do the job for you, testing is still a LOT OF WORK.

zar1048576•1h ago

My sense is that the asymmetry is non-trivial issue here. In particular, a threat actor needs one working path, defenders need to close all of them. In practice, patching velocity is bounded by release cycles, QA issues / regression risk, and a potentially large number of codebases that need to be looked at.

Veserv•1h ago

When did we enter the twilight zone where bug trackers are consistently empty? The limiting factor of bug reduction is remediation, not discovery. Even developer smoke testing usually surfaces bugs at a rate far faster than they can be fixed let alone actual QA.

To be fair, the limiting factor in remediation is usually finding a reproducible test case which a vulnerability is by necessity. But, I would still bet most systems have plenty of bugs in their bug trackers which are accompanied by a reproducible test case which are still bottlenecked on remediation resources.

This is of course orthogonal to the fact that patching systems that are insecure by design into security has so far been a colossal failure.

reactordev•36m ago

That might have been true pre LLMs but you can literally point an agent at the queue until it’s empty now.

batshit_beaver•28m ago

You literally cannot, since ANY changes to code tend to introduce unintended (or at least not explicitly requested) new behaviors.

reactordev•24m ago

I’ve had mine on a Ralph loop no problem. Just review the PR..

k_roy•24m ago

Which still means a single person with Claude can clear a queue in a day versus a month with a traditional team.

lll-o-lll•22m ago

Eventual convergence? Assuming each defect fix has a 30% chance of introducing a new defect, we keep cycling until done?

Kinrany•12m ago

Why would it converge?

saintfire•6m ago

Assuming you can catch every new bug it introduces.

Both assumptions being unlikely.

You also end up with a code base you let an AI agent trample until it is satisfied; ballooned in complexity and redudant brittle code.

Buttons840•1h ago

> If LLMs can really find a ton of vulnerabilities in my software, why would I not run them and just patch all the vulnerabilities, leading to perfectly secure software?

Probably because it will be a felony to do so. Or, the threat of a felony at least.

And this is because it is very embarrassing for companies to have society openly discussing how bad their software security is.

We sacrifice national security for the convenience of companies.

We are not allowed to test the security of systems, because that is the responsibility of companies, since they own the system. Also, companies who own the system and are responsible for its security are not liable when it is found to be insecure and they leak half the nations personal data, again.

Are you seeing how this works yet? Let's not have anything like verifiable and testable security interrupt the gravy train to the top. Nor can we expect systems to be secure all the time, be reasonable.

One might think that since we're all in this together and all our data is getting leaked twice a month, we could work together and all be on the lookout for security vulnerabilities and report them responsibly.

But no, the systems belong to companies, and they are solely responsible. But also (and very importantly) they are not responsible and especially they are not financially liable.

gruez•50m ago

>> If LLMs can really find a ton of vulnerabilities in my software, why would I not run them and just patch all the vulnerabilities, leading to perfectly secure software?

>Probably because it will be a felony to do so. Or, the threat of a felony at least.

"my software" implies you own it (ie. your SaaS), so CFAA isn't an issue. I don't think he's implying that vigilante hackers should be hacking gmail just because they have a gmail account.

htrp•1h ago

Attackers only have to be successful once while defenders have to be successful all the time?

joatmon-snoo•36m ago

Breaking something is easier than fixing it.

tptacek•33m ago

People have said that for decades and it wasn't true until recently.

underdeserver•22m ago

Specifically in software vulnerability research, you mean.

Fixing vulnerable code is usually trivial.

In the physical world breaking things is usually easier.

charcircuit•11m ago

A proper fix maybe. But LLMs can easily make it no longer exploitable.

spr-alex•1h ago

I interned for the author at 18. I assumed security testing worked like this:

1. Static analysis catches nearly all bugs with near-total code coverage

2. Private tooling extends that coverage further with better static analysis and dynamic analysis, and that edge is what makes contractors valuable

3. Humans focus on design flaws and weird hardware bugs like cryptographic side-channels from electromagnetic emanations

Turns out finding all the bugs is really hard. Codebases and compiler output have exploded in complexity over 20 years which has not helped the static analysis vision. Todays mitigations are fantastic compared to then, but just this month a second 0day chain got patched on one of the best platforms for hardware mitigations.

I think LLMs get us meaningfully closer to what I thought this work already was when I was 18 and didn't know anything.

cartoonworld•1h ago

lots of security issues form at the boundaries between packages, zones, services, sessions, etc. Static analysis could but doesn't seem to catch this stuff from my perspective. Bugs are often chains and that requires a lot of creativity, planning etc

consider logic errors and race conditions. Its surely not impossible for llm to find these, but it seems likely that you'll need to step throught the program control flow in order to reveal a lot of these interactions.

I feel like people consider LLM as free since there isn't as much hand-on-keyboard. I kinda disgree, and when the cost of paying out these vulns falls, I feel like nobody is gonna wanna eat the token spend. Plenty of hackers already use ai in their workflows, even then it is a LOT OF WORK.

Legend2440•57m ago

Catching all bugs with static analysis would involve solving the halting problem, so it's never going to happen.

IsTom•21m ago

A lot of software doing useful work halts pretty trivialy, consuming inputs and doing bounded computation on each of them. You're not going to recurse much in click handlers or keep making larger requests to handle the current one.

badgersnake•1h ago

Another boring AI hype article.

“The next model will be the one. Trust me. Just one more iteration.”

streetfighter64•1h ago

> Is the Linux KVM hypervisor connected to the hrtimer subsystem, workqueue, or perf_event? The model knows.

I asked ChatGPT and it claimed "all three". Any linux wizards who can confirm or deny?

Anyway, in my experience using mainly the Claude chat to do some basic (not security) bug hunting, it usually fixates on one specific hypothesis, and it takes some effort to get it off that wrong track, even when I already know it's barking up the wrong tree.

tptacek•1h ago

It's all three, I just had it on the brain when I was writing this.

streetfighter64•1h ago

Hm, kind of a strange question then, no? Is a car's engine connected to the fuel tank, the wheels or the accelerator pedal?

tptacek•52m ago

I don't know, maybe it is? My point is just that frontier models start off with latent models of all the interconnectivity in all the important open-source codebases, to a degree that would be infeasible for the people who learned how all the CSS object lifecycles and image rendering and unicode shaping stuff worked well enough to use them in exploits.

staticassertion•1h ago

> Everything is up in the air. The industry is sold on memory-safe software, but the shift is slow going. We’ve bought time with sandboxing and attack surface restriction. How well will these countermeasures hold up? A 4 layer system of sandboxes, kernels, hypervisors, and IPC schemes are, to an agent, an iterated version of the same problem. Agents will generate full-chain exploits, and they will do so soon.

I think this is the interesting bit. We have some insanely powerful isolation technology and mitigations. I can put a webassembly program into a seccomp'd wrapper in an unprivileged user into a stripped down Linux environment inside of Firecracker. An attacker breaking out of that feels like science fiction to me. An LLM could do it but I think "one shots" for this sort of attack are extremely unlikely today. The LLM will need to find a wasm escape, then a Linux LPE that's reachable from an unprivileged user with a seccomp filter, then once they have kernel control they'll need to manipulate the VM state or attack KVM directly.

A human being doing those things is hard to imagine. Exploitation of Firecracker is, from my view, extremely difficult. The bug density is very low - code quality is high and mitigation adoption is a serious hurdle.

Obviously people aren't just going to deploy software the way I'm suggesting, but even just "I use AWS Fargate" is a crazy barrier that I'm skeptical an LLM will cross.

> Meanwhile, no defense looks flimsier now than closed source code.

Interesting, I've had sort of the opposite view. Giving an LLM direct access to the semantic information of your program, the comments, etc, feels like it's just handing massive amounts of context over. With decompilation I think there's a higher risk of it missing the intention of the code.

edit: I want to also note that with LLMs I have been able to do sort of insane things. A little side project I have uses iframe sandboxing insanely aggressively. Most of my 3rd party dependencies are injected into an iframe, and the content is rendered in that iframe. It can communicate to the parent over a restricted MessageChannel. For cases like "render markdown" I can even leverage a total-blocking CSP within the sandbox. Writing this by hand would be silly, I can't do it - it's like building an RPC for every library I use. "Resize the window" or "User clicked this link" etc all have to be written individually. But with an LLM I'm getting sort of silly levels of safety here - Chrome is free to move each iframe into its own process, I get isolated origins, I'm immune from supply chain vulnerabilities, I'm immune to mostly immune to XSS (within the frame, where most of the opportunity is) and CSRF is radically harder, etc. LLMs have made adoption of Trusted Types and other mitigations insanely easy for me and, IMO, these sorts of mitigations are more effective at preventing attacks than LLMs will be at finding bypasses (contentious and platform dependent though!). I suppose this doesn't have any bearing on the direct position of the blog post, which is scoped to the new role for vulnerability research, but I guess my interest is obviously going to be more defense oriented as that's where I live :)

MajesticHobo2•1h ago

> With decompilation I think there's a higher risk of it missing the intention of the code.

I'm not sure but suspect the lack of comments and documentation might be an advantage to LLMs for this use case. For security/reverse engineering work, the code's actual behavior matters a lot more than the developer's intention.

staticassertion•57m ago

I think the other side of that is that mismatches between intention and implementation are exactly where you're going to find vulnerabilities. The LLM that looks at closed source code has to guess the intention to a greater degree.

moyix•39m ago

This is true for a lot of things but for low-level code you can always fall back to "the intention is to not violate memory safety".

staticassertion•35m ago

That's true, but certainly that's limiting. Still, even then, `# SAFETY:` comments seem extremely helpful. "For every `unsafe`, determine its implied or stated safety contract, then build a suite of adversarial tests to verify or break those contracts" feels like a great way to get going.

moyix•23m ago

It's limiting from the PoV of a developer who wants to ensure that their own code is free of all security issues. It is not limiting from the point of view of an attacker who just needs one good memory safety vuln to win.

rubiquity•1h ago

I was distracted by the picture of the ingredients to a Final Ward being at the top of the page.

narginal•1h ago

Just like how fuzzers will find all the bugs, right? Right?? There's definitely infrastructure at these big companies that isn't sitting in a while loop 'fuzzing' right? Why is it news that vulnerability research will continue to get harder, exactly? It has always been this way, exploits will get more expensive, and the best researchers will continue with whatever tools they find useful.

tptacek•54m ago

It's a good question. Fuzzers generated a surge of new vulnerabilities, especially after institutional fuzzing clusters got stood up, and after we converged on coverage-guided fuzzers like AFL. We then got to a stable equilibrium, a new floor, such that vulnerability research & discovery doesn't look that drastically different after fuzzing as before.

Two things to notice:

* First, fuzzers also generated and continue to generate large stacks of unverified crashers, such that you can go to archives of syzkaller crashes and find crashers that actually work. My contention is that models are not just going to produce hypothetical vulnerabilities, but also working exploits.

* Second, the mechanism 4.6 and Codex are using to find these vulnerabilities is nothing like that of a fuzzer. A fuzzer doesn't "know" it's found a vulnerability; it's a simple stimulus/response test (sequence goes in, crash does/doesn't come out). Most crashers aren't exploitable.

Models can use fuzzers to find stuff, and I'm surprised that (at least for Anthropic's Red Team) that's not how they're doing it yet. But at least as I understand it, that's generally not what they're doing. It something much closer to static analysis.

staticassertion•51m ago

I suspect we'll see combinations of symbolic execution + fuzzing as contextual inputs to LLMs, with LLMs delegating highly directed tasks to these external tools that are radically faster at exploring a space with the LLM guiding based on its own semantic understanding of the code.

I'm with you, I expected this to be happening already. Funny enough, I guess even a hardened codebase isn't at that level of "we need to optimize this" currently so you can just throw tokens at the problem.

tptacek•49m ago

Right, so that's exactly how I was thinking about it before I talked to Carlini. Then I talked to Carlini for the SCW podcast. Then I wrote this piece.

I don't know that I'm ready to say that the frontier of vulnerability research with agents is modeling, fuzzing, and analysis (orchestrated by an agent). It may very well be that the models themselves stay ahead of this for quite some time.

That would be a super interesting result, and it's the result I'm writing about here.

narginal•46m ago

I have just seen too much infrastructure set up to 'find bugs,' effectively sitting and doing nothing- either the wrong thing gets audited, or tons of compute gets thrown at a code base and nobody ever checks in on or verifies.

This seems like a human/structural issue that an AI won't actually fix - attackers/defenders alike will gain access to the same models, feels a little bit like we are back to square one

tptacek•44m ago

If that's true, and if patches can effectively be pushed out quickly, then the results of this will be felt mostly by vulnerability researchers, which is the subject of the piece. But those are big "ifs".

stackghost•47m ago

I was doing TryHackMe's "advent of cyber" sidequest last christmas and used a process very much like Carlini's that is outlined in TFA.

>I'm doing a CTF. I popped a shell on this box and found this binary. Here is a ghidra decompilation. Is there anything exploitable in $function?

You can't just ask Claude or ChatGPT to do the binex for you, but even last year's models were really good at finding heap or stack vulns this way.

thadt•42m ago

So the intersting question: are we long term safer with "simpler" closer to hardware memory unsafe(ish) environments like Zig, or is the memory safe but more abstract feature set of languages like Rust still the winning direction?

If a hypothetical build step is "look over this program and carfully examine the bounds of safety using your deep knowledge of the OS, hardware, language and all the tools that come along with it", then a less abstract environment might be at an overall advantage. In a moment, I'll close this comment and go back to writing Rust. But if I had the time (or tooling) to build something in C and test it as thoroughly as say, SQLite [1], then I might think harder about the tradeoffs.

[1] https://sqlite.org/whyc.html

tonymet•36m ago

I agree AI makes exploits more accessible, it also makes pen-testing and finding vulns more accessible, in both early and late stages of product development.

AI has saved me a ton of money and time auditing. Mostly because I'm tired / lazy.

It's both a black pill & white pill, and if we have the right discipline, a tremendous white pill. Engineers can no longer claim to be "cost effective" by ignoring vulns.

rkrbaccord94f•24m ago

The pipewire-libs package local address function refers to alsa_output.pci

Driver benchmarking the pipewire script calls three local ports:

local.source.port = 10001

local.repair.port = 10002

local.control.port = 10003

samuelknight•22m ago

LLMs are expert hackers because: 1) They are expert coders, including a decently comprehensive CVE knowledge 2) They know every programming language/framework/stack 3) They know every human language

They already have super human breadth and attention. And their depth is either super human or getting there.

The state of the security industry through 2025 was expensive appsec human reviewers or primitive scanners. Now you can spend a few dollars and have an expert intelligence scrutinize a whole network.

gdulli•13m ago

So much of the current internet is posts that read as a superposition of sincere and parody, and until that's resolved how do you know how to respond?

vibe42•10m ago

If everyone is running the same models, does this not favour white hat / defense?

Since many exploits consists of several vulnerabilities used in a chain, if a LLM finds one in the middle and it's fixed, that can change a zero day to something of more moderate severity?

E.g. someone finds a zero day that's using three vulns through different layers. The first and third are super hard to find, but the second is of moderate difficulty.

Automated checks by not even SOTA models could very well find the moderate difficulty vuln in the middle, breaking the chain.

anematode•8m ago

Ya, I tend to believe that (most) human VR will be obsoleted well before human software engineering. Software engineering is a lot more squishy and has many more opportunities to go off the rails. Once a goal is established, the output of VR agents is verifiable.

Fedware: Government apps that spy harder than the apps they ban

Do your own writing

Turning a MacBook into a touchscreen with $1 of hardware (2018)

Learn Claude Code by doing, not reading

How to turn anything into a router

Bird brains (2023)

Cherri – programming language that compiles to an Apple Shortuct

Agents of Chaos

Researchers find 3,500-year-old loom that reveals textile revolution

Roulette Computers: Hidden Devices That Predict Spins

OCR for construction documents does not work, we fixed it

William Blake, Remote by the Sea

Seeing Like a Spreadsheet

A sea of sparks: Seeing radioactivity

Show HN: Coasts – Containerized Hosts for Agents

CodingFont: A game to help you pick a coding font

How Iran is making a mint from the current war

In math, rigor is vital, but are digitized proofs taking it too far?

Build123d: A Python CAD programming library

Recover Apple Keychain

Mathematical methods and human thought in the age of AI

Take better notes, by hand

An NSFW filter for Marginalia search

I am definitely missing the pre-AI writing era

Car Seats as Contraception

The stealthy startup that pitched brainless human clones

I use Excalidraw to manage my diagrams for my blog

What we learned building 100 API integrations with OpenCode

FTC action against Match and OkCupid for deceiving users, sharing personal data

Vulnerability research is cooked

Fedware: Government apps that spy harder than the apps they ban

Do your own writing

Turning a MacBook into a touchscreen with $1 of hardware (2018)

Learn Claude Code by doing, not reading

How to turn anything into a router

Bird brains (2023)

Cherri – programming language that compiles to an Apple Shortuct

Agents of Chaos

Researchers find 3,500-year-old loom that reveals textile revolution

Roulette Computers: Hidden Devices That Predict Spins

OCR for construction documents does not work, we fixed it

William Blake, Remote by the Sea

Seeing Like a Spreadsheet

A sea of sparks: Seeing radioactivity

Show HN: Coasts – Containerized Hosts for Agents

CodingFont: A game to help you pick a coding font

How Iran is making a mint from the current war

In math, rigor is vital, but are digitized proofs taking it too far?

Build123d: A Python CAD programming library

Recover Apple Keychain

Mathematical methods and human thought in the age of AI

Take better notes, by hand

An NSFW filter for Marginalia search

I am definitely missing the pre-AI writing era

Car Seats as Contraception

The stealthy startup that pitched brainless human clones

I use Excalidraw to manage my diagrams for my blog

What we learned building 100 API integrations with OpenCode

FTC action against Match and OkCupid for deceiving users, sharing personal data

Vulnerability research is cooked

Vulnerability research is cooked

Comments