https://blog.mozilla.org/en/firefox/hardening-firefox-anthro...
https://www.wsj.com/tech/ai/send-us-more-anthropics-claude-s...
https://blog.mozilla.org/en/firefox/hardening-firefox-anthro...
https://www.wsj.com/tech/ai/send-us-more-anthropics-claude-s...
2. I wouldn't say LLMs are "better" than other fuzzers. Someone would need to measure findings/cost for that. But many LLMs do work at a higher level than most fuzzers, as they can generate plausible-looking source code.
> “Crude” is an important caveat here. The exploits Claude wrote only worked on our testing environment, which intentionally removed some of the security features found in modern browsers. This includes, most importantly, the sandbox, the purpose of which is to reduce the impact of these types of vulnerabilities. Thus, Firefox’s “defense in depth” would have been effective at mitigating these particular exploits.
Firefox has never required a full chain exploit in order to consider something a vulnerability. A large proportion of disclosed Firefox vulnerabilities are vulnerabilities in the sandboxed process.
If you look at Firefox's Security Severity Rating doc: https://wiki.mozilla.org/Security_Severity_Ratings/Client what you'll see is that vulnerabilities within the sandbox, and sandbox escapes, are both independently considered vulnerabilities. Chrome considers vulnerabilities in a similar manner.
I can hire an intern for that.
I believe there is a theoretical cap about the capability of LLM. I'm wondering what does it look like.
However that don't happen unless firefox just stop developing though. New code comes with new bug, and there must be some people or some tool to find it out.
1. Producing new tests to increase coverage. Migrating you to property testing. Setting up fuzzing. Setting up more static analysis tooling. All of that would normally take "time" but now it's a background task.
2. They can find some vulnerabilities. They are "okay" at this, but if you are willing to burn tokens then it's fine.
3. They are absolutely wrong sometimes about something being safe. I have had Claude very explicitly state that a security boundary existed when it didn't. That is, it appeared to exist in the same way that a chroot appears to confine, and it was intended to be a security boundary, but it was not a sufficient boundary whatsoever. Multiple models not only identified the boundary and stated it exists but referred to it as "extremely safe" or other such things. This has happened to me a number of times and it required a lot of nudging for it to see the problems.
4. They often seem to do better with "local" bugs. Often something that has the very obvious pattern of an unsafe thing. Sort of like "that's a pointer deref" or "that's an array access" or "that's `unsafe {}`" etc. They do far, far worse the less "local" a vulnerability is. Product features that interact in unsafe ways when combined, that's something I have yet to have an AI be able to pick up on. This is unsurprising - if we trivialize agents as "pattern matchers", well, spotting some unsafe patterns and then validating the known properties of that pattern to validate is not so surprising, but "your product has multiple completely unrelated features, bugs, and deployment properties, which all combine into a vulnerability" is not something they'll notice easily.
It's important to remain skeptical of safety claims by models. Finding vulns is huge, but you need to be able to spot the mistakes.
I agree that LLMs are sometimes wrong, which is why this new method here is so valuable - it provides us with easily verifiable testcases rather than just some kind of analysis that could be right or wrong. Purely triaging through vulnerability reports that are static (i.e. no actual PoC) is very time consuming and false-positive prone (same issue with pure static analysis).
I can't really confirm the part about "local" bugs anymore though, but that might also be a model thing. When I did experiments longer ago, this was certainly true, esp. for the "one shot" approaches where you basically prompt it once with source code and want some analysis back. But this actually changed with agentic SDKs where more context can be pulled together automatically.
I've personally used two AI-first static analysis security tools and found great results, including interesting business logic issues, across my employers SaaS tech stack. We integrated one of the tools. I look forward to getting employer approval to say which, but that hasn't happened yet, sadly.
If you're already at a very high coverage, the remaining bits are presumably just inherently difficult.
I couldn't bring myself to switch to the (even) more press-releasey title.
What I was thinking was, "Chromium team is definitely not going to collaborate with us because they have Gemini, while Safari belongs to a company that operates in a notoriously secretive way when it comes to product development."
They picked to most open-source one.
Sure there are closed source parts of Safari, but I'd guess at least 90% of safari attack surface is in WebKit and it's parts.
This is going to be the case automating attack detection against most programs where a portion is obscured.
There are also WebKit-based Linux browsers, which obviously do not use closed-source OS interfaces.
My pessimistic guess on reasoning is that they suspected Firefox to have more tech debt.
You say many cases, let's see some examples in Safari.
Safari (closed source)
├─ UI / tabs / preferences
├─ macOS / iOS integration
└─ WebKit framework (open source) ~60%
├─ WebCore (HTML/CSS/DOM)
├─ JavaScriptCore (JS engine)
└─ Web InspectorAnd this is a good reminder for me to add a prompt about property testing being preferred over straight unit tests and maybe to create a prompt for fuzz testing the code when we hit Ready state.
I don't think the problems I work on require the weight of formal verification, but I'm open to being wrong.
The origin of it is a hypothesis I can get better quality code out of agents by making them do the things I don't (or don't always). So rather than quitting at ~80% code coverage, I am asking it to cover closer to 95%. There's a code complexity gate that I require better grades on than I would for myself because I didn't write this code, so I can't say "Eh, I know how it works inside and out". And I keep adding little bits like that.
I think the agents have only used it 2 or 3 times. The one that springs to mind is a site I am "working" on where you can only post once a day. In addition, there's an exponential backoff system for bans to fight griefers. If you look at them at the same time, they're the same idea for different reasons, "User X should not be able to post again until [timestamp]" and there's a set of a dozen or so formal method proofs done in z3 to check the work that can be referenced (I think? god this all feels dumb and sloppy typed out) at checkpoints to ensure things have not broken the promises.
Mozilla wants AI money.
That's because there were none. All bugs came with verifiable testcases (crash tests) that crashed the browser or the JS shell.
For the JS shell, similar to fuzzing, a small fraction of these bugs were bugs in the shell itself (i.e. testing only) - but according to our fuzzing guidelines, these are not false positives and they will also be fixed.
Did you also test on old source code, to see if it could find the vulnerabilities that were already discovered by humans?
Then, with each model having a different training epoch, you end up with no useful comparison, to decide if new models are improving the situation. I don't doubt they are, just not sure this is a way to show it.
“Our first step was to use Claude to find previously identified CVEs in older versions of the Firefox codebase. We were surprised that Opus 4.6 could reproduce a high percentage of these historical CVEs”
I think it was curl that closed its bug bounty program due to AI spam.
The curl situation was completely different because as far as I know, these bugs were not filed with actual testcases. They were purely static bugs and those kinds of reports eat up a lot of valuable resources in order to validate.
The level of AI spam for Firefox security submissions is a lot lower than the curl people have described. I'm not sure why that is. Maybe the size of the code base and the higher bar to submitting issues plays a role.
There's some nuance here. I fixed a couple of shell-only Anthropic issues. At least mine were cases where the shell-only testing functions created situations that are impossible to create in the browser. Or at least, after spending several days trying, I managed to prove to myself that it was just barely impossible. (And it had been possible until recently.)
We do still consider those bugs and fix them one way or the other -- if the bug really is unreachable, then the testing function can be weakened (and assertions added to make sure it doesn't become reachable in the future). For the actual cases here, it was easier and better to fix the bug and leave the testing function in place.
We love fuzz bugs, so we try to structure things to make invalid states as brittle as possible so the fuzzers can find them. Assertions are good for this, as are testing functions that expose complex or "dangerous" configurations that would otherwise be hard to set up just by spewing out bizarre JS code or whatever. It causes some level of false positives, but it greatly helps the fuzzers find not only the bugs that are there, but also the ones that will be there in the future.
(Apologies for amusing myself with the "not only X, but also Y" writing pattern.)
LLMs made it harder to run bug bounty programs where anyone can submit stuff, and where a lot of people flooded them with seemingly well-written but ultimately wrong reports.
On the other hand, the newest generation of these LLMs (in their top configuration) finally understands the problem domain well enough to identify legitimate issues.
I think a lot of judging of LLMs happens on the free and cheaper tiers, and quality on those tiers is indeed bad. If you set up a bug bounty program, you'll necessarily get bad quality reports (as cost of submission is 0 usually).
On the other hand, if instead of a bug bounty program you have an "top tier LLM bug searching program", then then the quality bar can be ensured, and maintainers will be getting high quality reports.
Maybe one can save bug bounty programs by requiring a fee to be paid, idk, or by using LLM there, too.
Free for 6 months after which it auto-renews if I recall correctly.
Their OSS offer is first-hit-is-free.
are there any projects to auto-verify submitted bug reports? perhaps by spinning up a VM and then having an agent attempt to reproduce the bug report? that would be neat.
(/s if it’s not clear)
I am concerned.
> Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them. This gives defenders the advantage. And with the recent release of Claude Code Security in limited research preview, we’re bringing vulnerability-discovery (and patching) capabilities directly to customers and open-source maintainers.
> But looking at the rate of progress, it is unlikely that the gap between frontier models’ vulnerability discovery and exploitation abilities will last very long. If and when future language models break through this exploitation barrier, we will need to consider additional safeguards or other actions to prevent our models from being misused by malicious actors.
> We urge developers to take advantage of this window to redouble their efforts to make their software more secure. For our part, we plan to significantly expand our cybersecurity efforts, including by working with developers to search for vulnerabilities (following the CVD process outlined above), developing tools to help maintainers triage bug reports, and directly proposing patches.
That’s a different kind of productivity but equally valuable.
1. Accompanying minimal test cases
2. Detailed proofs-of-concept
3. Candidate patches
This is most similar to fuzzing, and in fact could be considered another variant of fuzzing, so I'll compare to that. Good fuzzing also provides minimal test cases. The Anthropic ones were not only minimal but well-commented with a description of what it was up to and why. The detailed descriptions of what it thought the bug was were useful even though they were the typical AI-generated descriptions that were 80% right and 20% totally off base but plausible-sounding. Normally I don't pay a lot of attention to a bug filer's speculations as to what is going wrong, since they rarely have the context to make a good guess, but Claude's were useful and served as a better starting point than my usual "run it under a debugger and trace out what's happening" approach. As usual with AI, you have to be skeptical and not get suckered in by things that sound right but aren't, but that's not hard when you have a reproducible test case provided and you yourself can compare Claude's explanations with reality.The candidate patches were kind of nice. I suspect they were more useful for validating and improving the bug reports (and these were very nice bug reports). As in, if you're making a patch based on the description of what's going wrong, then that description can't be too far off base if the patch fixes the observed problem. They didn't attempt to be any wider in scope than they needed to be for the reported bug, so I ended up writing my own. But I'd rather them not guess what the "right" fix was; that's just another place to go wrong.
I think the "proofs-of-concept" were the attempts to use the test case to get as close to an actual exploit as possible? I think those would be more useful to an organization that is doubtful of the importance of bugs. Particularly in SpiderMonkey, we take any crash or assertion failure very seriously, and we're all pretty experienced in seeing how seemingly innocuous problems can be exploited in mind-numbingly complicated ways.
The Anthropic bug reports were excellent, better even than our usual internal and external fuzzing bugs and those are already very good. I don't have a good sense for how much juice is left to squeeze -- any new fuzzer or static analysis starts out finding a pile of new bugs, but most tail off pretty quickly. Also, I highly doubt that you could easily achieve this level of quality by asking Claude "hey, go find some security bugs in Firefox". You'd likely just get AI slop bugs out of that. Claude is a powerful tool, but the Anthropic team also knew how to wield it well. (They're not the only ones, mind.)
But for most other projects, it probably only costs $3 worth of tokens. So you should assume the bad guys have already done it to your project looking for things they can exploit, and it no longer feels responsible to not have done such an audit yourself.
Something that I found useful when doing such audits for Zulip's key codebases is the ask the model to carefully self-review each finding; that removed the majority of the false positives. Most of the rest we addressed via adding comments that would help developers (or a model) casually reading the code understand what the intended security model is for that code path... And indeed most of those did not show up on a second audit done afterwards.
This post did a little bit of that but I wish it had gone into more detail.
Turns out it's the other way around - AI is protecting the Mozilla Foundation from us.
fcpk•8h ago
pjmlp•8h ago
tptacek•4h ago
john_strinlai•4h ago
tptacek•4h ago
JumpCrisscross•3h ago
Rando here. It gives a signal on the account’s other comments, as well as the value of the original comment (as a hypothesis, albeit a wrong one, versus blind raging).
john_strinlai•2h ago
fair enough. i typically use karma as a rough proxy for that, especially when the user has a lot of it (like, in this case, where the poster is #17 on the leaderboard with 100,000+ karma). you dont get that much karma if you are consistently posting bad takes.
>as well as the value of the original comment (as a hypothesis, albeit a wrong one, versus blind raging).
i dont see, in this case anyways, how or why that distinction would matter or change anything (in this case specifically, what would you change or do differently if it was a hypothesis or simple "raging"?), but im probably just thinking about it incorrectly.
TheBicPen•2h ago
I wonder how true that is. While this site doesn't have incentivize engagement-maximizing behaviour (posting ragebait) like some other sites do, I would imagine that simply posting more is the best way to accrue karma long-term.
john_strinlai•2h ago
i definitely agree, which is why i use it as a rough proxy rather than ground truth, but i have my doubts that you can casually "post more" your way into the top 20 karma users of all time.
tptacek•1h ago
john_strinlai•1h ago
just like you were genuinely trying to understand where pjmlp was coming from, i was genuinely trying to understand what you would get out of an answer to your question (or, like, what the next reply could even be other than "ok, cool").
tptacek•1h ago
pjmlp•3h ago
tptacek•1h ago
deafpolygon•8h ago
muizelaar•8h ago
robin_reala•7h ago
deafpolygon•7h ago
tclancy•6h ago
deafpolygon•6h ago
tclancy•4h ago
tclancy•2h ago
moffkalast•1h ago
iosifache•8h ago
[1] https://www.mozilla.org/en-US/security/advisories/mfsa2026-1...
[2] https://www.anthropic.com/news/mozilla-firefox-security
jandem•8h ago
larodi•6h ago