XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

https://xbow.com/blog/top-1-how-xbow-did-it/

284•summarity•7mo ago

Comments

ikmckenz•7mo ago

moyix•7mo ago

The main difference is that all of the vulnerabilities reported here are real, many quite critical (XXE, RCE, SQLi, etc.). To be fair there were definitely a lot of XSS, but the main reason for that is that it's a really common vulnerability.

ikmckenz•7mo ago

All of them are real? You have a 100% rate of reports closed as valid?

andrewstuart•7mo ago

All the fun vanishes.

tptacek•7mo ago

Good. It was in the way.

kiitos•7mo ago

In the way of what?

tptacek•7mo ago

Getting more bugs fixed.

kiitos•7mo ago

> Getting more bugs fixed.

OK.. but "getting more bugs fixed" isn't any kind of objective success metric for, well, anything, right?

It's fine if you want to use it as a KPI for your specific thing! But it's not like it's some global KPI for everyone?

tptacek•7mo ago

It's very specifically the objective of a security bug bounty.

ryandrake•7mo ago

Receiving hundreds of AI generated bug reports would be so demoralizing and probably turn me off from maintaining an open source project forever. I think developers are going to eventually need tools to filter out slop. If you didn’t take the time to write it, why should I take the time to read it?

triknomeister•7mo ago

Eventually projects who can afford the smugness are going to charge people to be able to talk to open source developers.

tough•7mo ago

isnt that called enterprise support / consulting

triknomeister•7mo ago

This is without the enterprise.

tough•7mo ago

gotchu, maybe i could see github donations enabling issue creation or wahtever in the future idk

but foss is foss, i guess source available doesnt mean we have to read your messages see sqlite (wont even take PR's lol)

jgalt212•7mo ago

One would think if AI can generate the slop it could also triage the slop.

err4nt•7mo ago

How does it know the difference?

scubbo•7mo ago

I'm still on the AI-skeptic side of the spectrum (though shifting more towards "it has some useful applications"), but, I think the easy answer is - if different models/prompts are used in generation than in quality-/correctness-checking.

jgalt212•7mo ago

I think Claude, given enough time to mull it over, could probably come up with some sort of bug severity score.

beng-nl•7mo ago

This might not always work, but whenever possible, a working exploit could be demanded, working in a form that can be automatically verified to work.

teeray•7mo ago

You see, the dream is another AI that reads the report and writes the issue in the bug tracker. Then another AI implements the fix. A third AI then reviews the code and approves and merges it. All without human interaction! Once CI releases the fix, the first AI can then find the same vulnerability plus a few new and exciting ones.

dingnuts•7mo ago

This is completely absurd. If generating code is reliable, you can have one generator make the change, and then merge and release it with traditional software.

If it's not reliable, how can you rely on the written issue to be correct, or the review, and so how does that benefit you over just blindly merging whatever changes are created by the model?

tempodox•7mo ago

Making sense is not required as long as “AI” vendors sell subscriptions.

croes•7mo ago

That’s why parent wrote it’s a dream.

It’s not real.

But you can bet someone will sell that as the solution.

Nicook•7mo ago

Open source maintainers have been complaining about this for a while. https://sethmlarson.dev/slop-security-reports. I'm assuming the proliferation of AI will have some significant changes on/already has had for open source projects.

nestorD•7mo ago

Yes! I recently had to manually answer and close a Github issue telling me I might have pushed an API key to github. No, "API_KEY=put-your-key-here;" is a placeholder and I should not have to waste time writing that.

tptacek•7mo ago

These aren't like Github Issues reports; they're bug bounty programs, specifically stood up to soak up incoming reports from anonymous strangers looking to make money on their submissions, with the premise being that enough of those reports will drive specific security goals (the scope of each program is, for smart vendors, tailored to engineering goals they have internally) to make it worthwhile.

ryandrake•7mo ago

Got it! The financial incentive will probably turn out to be a double edged sword. Maybe in the pre-AI age, it’s By Design to drive those goals, but I bet the ability to automate submissions will inevitably alter the rules of these programs.

I think within the next 5 years or so, we are going to see a societal pattern repeating: any program that rewards human ingenuity and input will become industrialized by AI to the point where it becomes a cottage industry of companies flooding every program with 99% AI submissions. What used to be lone wolves or small groups of humans working on bounties will become truckloads of AI generated “stuff” trying to maximize revenue.

dcminter•7mo ago

I'm wary of a lot of AI stuff, but here:

> What used to be lone wolves or small groups of humans working on bounties will become truckloads of AI generated “stuff” trying to maximize revenue.

You're objecting to the wrong thing. The purpose of a bug bounty programme is not to provide a cottage industry for security artisans - it's to flush out security vulnerabilities.

There are reasonable objections to AI automation in this space, but this is not one of them.

t0mas88•7mo ago

Might be fixable by adding a $ 100 submission fee that is returned when you're proving working exploit code. Would make the Curl team a lot of money.

billy99k•7mo ago

I've been on Hackerone for almost 8 years and I think the problem with this is that too many companies won't pay for legitimate bugs, even when you have a working exploit.

I had one critical bug take 3 years to get a pay out. I had a full walkthrough with videos and report. The company kept stalling and at one point told me that because they completely had the app remade, they weren't going to pay me anything.

Hackerone doesn't really protect the researcher either. I was told multiple times that there was 'nothing they could do'.

I eventually got paid, but this is pretty normal behavior with regards to bug bounty. Too many companies use it for free security work.

tptacek•7mo ago

I do think HackerOne is problematic, in that it pushes companies that don't really understand bug bounties to stand up bounty programs without a clear reason. If you're doing a serious bounty, your incentive is to pay out. But a lot of companies do these bounties because they just think they're supposed to.

Most companies should not do bug bounties.

moyix•7mo ago

All of these reports came with executable proof of the vulnerabilities – otherwise, as you say, you get flooded with hallucinated junk like the poor curl dev. This is one of the things that makes offensive security an actually good use case for AI – exploits serve as hard evidence that the LLM can't fake.

eeeeeeehio•7mo ago

Is "proof of vulnerability" a marketing term, or do you actually claim that XBOW has a 0% false positive rate? (i.e. "all" reports come with a PoV, and this PoV "proves" there is a vulnerability?)

bawolff•7mo ago

If you think the AI slop is demoralizing, you should see the human submissions bug bounties get.

There is a reason companies like hackerone exist - its because dealing with the submissions is terrible.

mkagenius•7mo ago

> XBOW submitted nearly 1,060 vulnerabilities.

Yikes, explains why my manually submitted single vulnerability is taking weeks to triage.

tptacek•7mo ago

The XBOW people are not randos.

lcnPylGDnU4H9OF•7mo ago

That's not their point, I think. They're just saying that those nearly 1060 vulnerabilities are being processed so theirs is being ignored (hence "triage").

tptacek•7mo ago

If that's all they're saying then there isn't much to do with the sentiment; if you're legit-finding #1061 after legit-findings #1-#1060, that's just life in the NFL. I took instead the meaning that the findings ahead of them were less than legit.

croes•7mo ago

Whether it is legit-finding is precisely what needs to be checked, but you’re at spot 1061.

>130 resolved

>303 were classified as Triaged

>33 reports marked as new

>125 remain pending

>208 were marked as duplicates

>209 as informative

>36 not applicable

20% bind a lot of resources if you have a high input on submissions and the numbers will rise

tptacek•7mo ago

I think some context I probably don't share with the rest of this thread is that the average quality of a Hacker One submission is incredibly low. Like however bad you think the median bounty submission is, it's worse; think "people threatening to take you to court for not paying them for their report that they can 'XSS' you with the Chrome developer console".

croes•7mo ago

We‘ll get this low quality submissions with AI too.

The problem is that the people who know how to use AI properly will slower and more careful in their submissions.

Many others won’t, so we‘ll get lots of noise hiding the real issues. AI makes it easy to produce many bad results in short time.

tptacek•7mo ago

Everyone already agrees with that; the interesting argument here is that it also makes it easy to produce many good results in short time.

croes•7mo ago

But the good ones don’t have the same output rate because they are checked by humans before submission.

They are faster than the purely manual ones but can’t beat the AI created bad ones neither in speed nor numbers.

It’s like the IT security version of the Gish gallop.

tptacek•7mo ago

Then you're refuting the premise of the article, and you should be more specific in your critique, because right now all you're saying is "this can't work".

peanut-walrus•7mo ago

My favorite one I've seen is "open redirect when you change the domain name in the browser address bar". This was submitted twice several years apart by two different people.

aspenmayer•7mo ago

I can’t speak to the average quality of submissions, as I’ve only made one to HackerOne myself iirc. I don’t even consider myself good at coding or aware of how to file a bug report or bounty submission. I reported that on iOS Coinbase app, that if you were on a VPN, the Coinbase app PIN simply didn’t exist anymore, and did not appear in the settings as enabled either. I included a full video of this occurring and it seemed reproducible. The Coinbase person said that this was not an issue because you would already need access to the physical device and know the iOS passcode; relevant to this is that at the time (2021) and maybe now, the Coinbase iOS app didn’t hook the iOS passcode for access control, like Signal or other apps do, but instead has its own app passcode. The fact that this was circumventable by adding and connecting to any VPN on the same iOS device seemed like a bug in the implementation, even if it is the code working as written. The issue was closed and I lost 5 HackerRank I think the points are called. It felt very hostile to my efforts that I lost points, since I don’t think that was justified. Perhaps that is just how the platform works for denied bug reports on HackerOne, but I have no way of knowing that, as the Coinbase report is the only time I used the platform.

mkagenius•7mo ago

They have a concept of "rando" as you can see above. They don't usually say that out aloud.

Basically if you are new, the reviewer thinks "oh, a rando" and in his mind he has already downgraded the severity a bit.

It's unfortunately a kind of cartel at this point. Not full fledged and out but a low key cartel. They have a circle of friends whose csrf would also get better valuation. It's a sorry state.

lcnPylGDnU4H9OF•7mo ago

> there isn't much to do with the sentiment

I see what you're saying but I think a more charitable interpretation can be made. They may be amazed that so many bug reports are being generated by such a reputable group. Looking at your initial reply, perhaps a more constructive comment could be one that joins them in excitement (even if that assumption is erroneous) and expanding on why you think it is exciting (e.g. this group's reputation for quality).

stronglikedan•7mo ago

> I took instead the meaning that the findings ahead of them were less than legit.

I took instead the opposite - that they were no longer shocked that it was taking so long once they found out why, as they knew who they were and understood.

k0ns0l•7mo ago

jekwoooooe•7mo ago

They should ban this or else they will get swallowed up and companies will stop working with them. The last thing I want is a bunch of llm slop sent to me faster than a human would

fredfish•7mo ago

As long as they maintain a history per account and discourage gaming with new accounts, I don't see why anyone would want slop that performed lower just because the slop was manual. (I just had someone tell me that they wished the nonsensical bounty submissions they triaged were at least being fixed up with gpt3.)

danmcs•7mo ago

HackerOne was already useless years before LLMs. Vulnerability scanning was already automated.

When we put our product on there, roughly 2019, the enterprising hackers ran their scanners, submitted everything they found as the highest possible severity to attempt to maximize their payout, and moved on. We wasted time triaging all the stuff they submitted that was nonsense, got nothing valuable out of the engagement, and dropped HackerOne at the end of the contract.

You'd be much better off contracting a competent engineering security firm to inspect your codebase and infrastructure.

tptacek•7mo ago

Moreover, I don't think XBOW is likely generating the kind of slop beg bounty people generate. There's some serious work behind this.

tecleandor•7mo ago

Still they're sending hundreds of reports that are being refused because they are not following the rules of the bounties. So they better work on that.

tptacek•7mo ago

If you thought human bounty program participants were generally following the rules, or that programs weren't swamped with slop already... at least these are actually pre-triaged vetted findings.

tecleandor•7mo ago

But I was hoping the idea wasn't "as there's a lot of sloppy posts, we're going to be sloppy too let's flood them". So, use the AI for something useful and at least grep the rules properly. That'd be neat.

weq•7mo ago

In the first version it grepped the rules properly. By the 10th interation those rules were lost to the heavens, and replaced by a newly hallucinated set that no one noticed because everyone was now dumber.

radialstub•7mo ago

Do you have sources for if we want to learn more?

moyix•7mo ago

We've got a bunch of agent traces on the front page of the web site right now. We also have done writeups on individual vulnerabilities found by the system, mostly in open source right now (we did some fun scans of OSS projects found on Docker Hub). We have a bunch more coming up about the vulns found in bug bounty targets. The latter are bottlenecked by getting approval from the companies affected, unfortunately.

Some of my favorites from what we've released so far:

- Exploitation of an n-day RCE in Jenkins, where the agent managed to figure out the challenge environment was broken and used the RCE exploit to debug the server environment and work around the problem to solve the challenge: https://xbow.com/#debugging--testing--and-refining-a-jenkins...

- Authentication bypass in Scoold that allowed reading the server config (including API keys) and arbitrary file read: https://xbow.com/blog/xbow-scoold-vuln/

- The first post about our HackerOne findings, an XSS in Palo Alto Networks GlobalProtect VPN portal used by a bunch of companies: https://xbow.com/blog/xbow-globalprotect-xss/

strken•7mo ago

We still get reports for such major issues as "this unused domain held my connection for ten seconds and then timed out, which broke the badly-written SQL injection scanner I found on GitHub and ran without understanding".

tecleandor•7mo ago

First:

> To bridge that gap, we started dogfooding XBOW in public and private bug bounty programs hosted on HackerOne. We treated it like any external researcher would: no shortcuts, no internal knowledge—just XBOW, running on its own.

Is it dogfooding if you're not doing it to yourself? I'd considerit dogfooding only if they were flooding themselves in AI generated bug reports, not to other people. They're not the ones reviewing them.

Also, honest question: what does "best" means here? The one that has sent the most reports?

jamessinghal•7mo ago

Their success rates on HackerOne seem widely varying.

  22/24 (Valid / Closed) for Walt Disney

  3/43 (Valid / Closed) for AT&T

thaumasiotes•7mo ago

> Their success rate on HackerOne seems widely varying.

Some of that is likely down to company policies; Snapchat's policy, for example, is that nothing is ever marked invalid.

jamessinghal•7mo ago

Yes, I'm sure anyone with more HackerOne experience can give specifics on the companies' policies. For now, those are the most objective measures of quality we have on the reports.

moyix•7mo ago

This is discussed in the post – many came down to individual programs' policies e.g. not accepting the vulnerability if it was in a 3rd party product they used (but still hosted by them), duplicates (another researcher reported the same vuln at the same time; not really any way to avoid this), or not accepting some classes of vuln like cache poisoning.

pclmulqdq•7mo ago

Walt Disney doesn't pay bug bounties. AT&T's bounties go up to $5k, which is decent but still not much. It's possible that the market for bugs is efficient.

monster_truck•7mo ago

Walt Disney's program covers substantially more surface area, there's 6? publicly traded companies listed there. In addition to covering far fewer domains & apps, AT&T's conditions and exclusions disqualify a lot more.

The market for bounties is a circus, breadcrumbs for free work from people trying to 'make it'. It can safely be analogized to the classic trope of those wanting to work in games getting paid fractional market rates for absurd amounts of QA effort. The number of CVSS vulns with a score above 8 that have floated across the front page of HN in the past year without anyone getting paid tells you that much.

ackbar03•7mo ago

> The market for bounties is a circus, breadcrumbs for free work from people trying to 'make it'. > The number of CVSS vulns with a score above 8 that have floated across the front page of HN in the past year without anyone getting paid tells you that much.

You make it sound like there's a ton of people going around who can just dig up CVSS vulns above 8 and is making me all confused. Is that really happening? I have a single bounty on H1 just to show I could do it, and that still took ages and was a shitty bug.

monster_truck•7mo ago

The weighted average is 7.6. Finding them doesn't necessarily take much effort if you know what to look for.

inhumantsar•7mo ago

I think they mean dogfooding as in putting on the "customer" hat and using the product.

Seems reasonable to call that dogfooding considering that flooding themselves wouldn't be any more useful than synthetic testing and there's only so much ground they could cover using it on their own software.

If this were coming out of Microsoft or IBM or whatever then yeah, not really dogfooding.

bgwalter•7mo ago

"XBOW is an enterprise solution. If your company would like a demo, email us at info@xbow.com."

Like any "AI" article, this is an ad.

If you are willing to tolerate a high false positive rate, you can as well use Rational Purify or various analyzers.

moyix•7mo ago

You should come to my upcoming BlackHat talk on how we did this while avoiding false positives :D

https://www.blackhat.com/us-25/briefings/schedule/#ai-agents...

tptacek•7mo ago

You should publish the paper quietly here (I'm a Black Hat reviewer, FWIW) so people can see where you're coming from.

I know you've been on HN for awhile, and that you're doing interesting stuff; HN just has a really intense immune system against vendor-y stuff.

moyix•7mo ago

Yeah, it's been very strange being on the other side of that after 10 years in academia! But it's totally reasonable for people to be skeptical when there's a bunch of money sloshing around.

I'll see if I can get time to do a paper to accompany the BH talk. And hopefully the agent traces of individual vulns will also help.

tptacek•7mo ago

J'accuse! You were required to do a paper for BH anyways! :)

moyix•7mo ago

Wait a sec, I thought they were optional?

> White Paper/Slide Deck/Supporting Materials (optional)

> • If you have a completed white paper or draft, slide deck, or other supporting materials, you can optionally provide a link for review by the board.

> • Please note: Submission must be self-contained for evaluation, supporting materials are optional.

> • PDF or online viewable links are preferred, where no authentication/log-in is required.

(From the link on the BHUSA CFP page, which confusingly goes to the BH Asia doc: https://i.blackhat.com/Asia-25/BlackHat-Asia-2025-CFP-Prepar... )

tptacek•7mo ago

I think you're fine, most people don't take the paper bit seriously. It's not due until the end of July regardless (you don't need a paper to submit for the CFP).

daeken•7mo ago

The scramble to get your paper done in time is traditional! (And why my final paper for the onity lock hack ended up with an entire section I decided was better off left unsaid; woops)

leenify•7mo ago

Hmm, is that really true? I spoke at BH last year and was not required to submit a paper. And based on the briefings link, there surely isn't a paper link, only slides and tool.

nickpsecurity•7mo ago

"we"

I remember your work on seeding vulnerabilities into C programs. I didnt know you got into AI-assisted pentesting. I already have more confidence in the product. :)

mellosouls•7mo ago

Have XBow provided a link to this claim, I could only find:

https://hackerone.com/xbow?type=user

Which shows a different picture. This may not invalidate their claim (best US), but a screenshot can be a bit cherry-picked.

zndr•7mo ago

If you scroll down on [the leaderboard](https://hackerone.com/leaderboard?year=2025&quarter=2&owasp=...) page to Country and select United States, xbow is currently on top

mellosouls•7mo ago

Ah thanks, I think it would be useful for them to perhaps add it as a footnote or something.

chc4•7mo ago

I'm generally pretty bearish on AI security research, and think most people don't know anything about what they're talking about, but XBOW is frankly one of the few legitimately interesting and competent companies in the space, and their writeups and reports have good and well thought out results. Congrats!

wslh•7mo ago

I'm looking forward to the LLM's ELI5 explanation. If I understand correctly, XBOW is genuinely moving the needle and pushing the state of the art.

Another great reading is [1](2024).

[1] "LLM and Bug Finding: Insights from a $2M Winning Team in the White House's AIxCC": https://news.ycombinator.com/item?id=41269791

hinterlands•7mo ago

Xbow has really smart people working on it, so they're well-aware of the usual 30-second critiques that come up in this thread. For example, they take specific steps to eliminate false positives.

The #1 spot in the ranking is both more of a deal and less of a deal than it might appear. It's less of a deal in that HackerOne is an economic numbers game. There are countless programs you can sign up for, with varied difficulty levels and payouts. Most of them pay not a whole lot and don't attract top talent in the industry. Instead, they offer supplemental income to infosec-minded school-age kids in the developing world. So I wouldn't read this as "Xbow is the best bug hunter in the US". That's a bit of a marketing gimmick.

But this is also not a particularly meaningful objective. The problem is that there's a lot of low-hanging bugs that need squashing and it's hard to allocate sufficient resources to that. Top infosec talent doesn't want to do it (and there's not enough of it). Consulting companies can do it, but they inevitably end up stretching themselves too thin, so the coverage ends up being hit-and-miss. There's a huge market for tools that can find easy bugs cheaply and without too many false positives.

I personally don't doubt that LLMs and related techniques are well-tailored for this task, completely independent of whether they can outperform leading experts. But there are skeptics, so I think this is an important real-world result.

absurdo•7mo ago

> so they're well-aware of the usual 30-second critiques that come up in this thread.

Succinct description of HN. It’s a damn shame.

normie3000•7mo ago

> Top infosec talent doesn't want to do it (and there's not enough of it).

What is the top talent spending its time on?

hinterlands•7mo ago

Vulnerability researchers? For public projects, there's a strong preference for prestige stuff: ecosystem-wide vulnerabilities, new attack techniques, attacking cool new tech (e.g., self-driving cars).

To pay bills: often working for tier A tech companies on intellectually-stimulating projects, such as novel mitigations, proprietary automation, etc. Or doing lucrative consulting / freelance work. Generally not triaging Nessus results 9-to-5.

tptacek•7mo ago

Specialized bug-hunting.

UltraSane•7mo ago

The best paying bug bounties.

atemerev•7mo ago

"A bolt cutter pays for itself starting from the second bike"

mr_mitm•7mo ago

Working from 9 to 5 for a guaranteed salary that is not dependent on how many bugs you find before anybody else, and not having to argue your case or negotiate the bounty.

kalium-xyz•7mo ago

From my experience they work on random person projects 90% of their time

bgwalter•7mo ago

Maybe that is because the article is chaotic (like any "AI" article) and does not really address the false positive issue in a well.presented manner? Or even at all?

Below people are reading the tea leaves to get any clue.

moomin•7mo ago

There’s two whole paragraphs under a dedicated heading. I don’t think the problem is with the article here. Paragraphs reproduced below:

AI can be remarkably effective at discovering a broad range of vulnerabilities—but the real challenge isn’t always detection, It’s precision. Automation has long struggled with false positives, and nowhere is this more evident than in vulnerability scanning. Tools that flag dozens of irrelevant issues often create more work than they save. When AI enters the equation, the stakes grow even higher: models can generalize well, but verifying technical edge cases is a different game entirely.

To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks. For example, to validate Cross-Site Scripting findings, a headless browser visits the target site to verify that the JavaScript payload was truly executed. (don’t miss Brendan Dolan-Gavitt’s BlackHat presentation on AI agents for Offsec)

eeeeeeehio•7mo ago

This doesn't say anything about many false positives they actually have. Yes, you can write other programs (that might even invoke another LLM!) to "check" the findings. That's a very obvious and reasonable thing to do. But all "vulnerability scanners", AI or not, must take steps to avoid FP -- that doesn't tell us how well they actually work.

The glaring omission here is a discussion of how many bugs the XBOW team had to manually review in order to make ~1k "valid" submissions. They state:

> It was a unique privilege to wake up each morning and review creative new exploits.

How much of every morning was spent reviewing exploits? And what % of them turned out to be real bugs? These are the critical questions that are (a) is unanswered by this post, and (b) determine the success of any product in this space imo.

Sytten•7mo ago

100% agree with OP, to make a living in BBH you can't go hunting on VDP program that don't pay anything all day. That means you will have a lot of low hanging fruits on those programs.

I don't think LLM replace humans, they do free up time to do nicer tasks.

skeeter2020•7mo ago

...which is exactly what technology advancements in our field have done since its inception, vs. the "this changes everything for everybody forever" narative that makes AI cheerleaders so exhausting.

moomin•7mo ago

Honestly I think this is extremely impressive, but it also raises what I call the “junior programmer” problem. Say XBOW gets good enough to hoover up basically all that money and can do it cost-effectively. What then happens to the pipeline of security researchers?

martinald•7mo ago

This does not surprise me. In a couple of 'legacy' open source projects I found DoS attacks within 10 minutes, with a working PoC. It crashed the server entirely. I suspect with more prompting it could have found RCE but it was an idle shower thought to try.

While niche and not widely used; there are at least thousands of publicly available servers for each of these projects.

I genuinely think this is one of the biggest near term issues with AI. Even if we get great AI "defence" tooling, there are just so many servers and (IoT or otherwise) devices out there, most of which is not trivial to patch. While a few niche services getting pwned isn't probably a big deal, a million niche services all getting pwned in quick succession is likely to cause huge disruption. There is so much code out there that hasn't been remotely security checked.

Maybe the end solution is some sort of LLM based "WAF" that inspects all traffic that ISPs deploy.

ActorNightly•7mo ago

Legacy code (especially C++) does suffer from a lot of bugs, however modern code is generally much better.

There is also a BIG hurdle between crashing something (which generally will be detected), versus RCE which requires a lot more work.

Sytten•7mo ago

Since I am the cofounder of a mostly manual based testing in that space we do follow the new AI hackbots closely. There is a lot of money being raised (Horizon3 at 100M, Xbow at 87M, Mindfort will probably soon raise).

The future is definitely a combination of human and bots like anything else, it won't replace the humans just like coding bots won't replace devs. In fact this will allow humans to focus ob the fun/creative hacking instead of the basic/boring tests.

What I am worried about is on the triage/reproduction side, right now it is still mostly manual and it is a hard problem to automate.

jp0001•7mo ago

I want to know how much they made in bounties versus how much they spent on compute.

The thing about bug bounties, the only way to win is to not play the game.

eddd-ddde•7mo ago

I think you have to factor in the training value. Each submission is a new data point that will only improve the overall performance.

skeptrune•7mo ago

I'm confused on whether or not this actually outperformed humans. The more interesting statistic would be how much money it made versus the average hacker one top ranked contributor.

lallysingh•7mo ago

I think a better one would be the average compute cost for xbow per exploit found, if you're interested in the shift in security economics this represents.

moktonar•7mo ago

While impressive, a lot of manual human work was involved both to filter the input and the output, this is not a “fully” automated workflow, sorry. But, yeah, kudos to them.

vmayoral•7mo ago

It’s humans who:

- Design the system and prompts

- Build and integrate the attack tools

- Guide the decision logic and analysis

This isn’t just semantics — overstating AI capabilities can confuse the public and mislead buyers, especially in high-stakes security contexts.

I say this as someone actively working in this space. I participated in the development of PentestGPT, which helped kickstart this wave of research and investment, and more recently, I’ve been working on Cybersecurity AI (CAI) — the leading open-source project for building autonomous agents for security:

- CAI GitHub: https://github.com/aliasrobotics/cai

- Tech report: https://arxiv.org/pdf/2504.06017

I’m all for pushing boundaries, but let’s keep the messaging grounded in reality. The future of AI in security is exciting — and we’re just getting started.

vasco•7mo ago

> It's humans

Who would it be, gremlins? Those humans weren't at the top of the leaderboard before they had the AI, so clearly it helps.

vmayoral•7mo ago

Actually, those humans (XBOW's) were already top rankers. Just look it up.

What's being critized here is the hype, which can be misleading and confusing. On this topic, wrote a small essay: “Cybersecurity AI: The Dangerous Gap Between Automation and Autonomy,” to sort fact from fiction -> https://shorturl.at/1ytz7

TZubiri•7mo ago

This would be impressive even as a human assisted project.

But there's a claim that it is unsupervised, which I doubt. See how these two claims contradict each other.

>"XBOW is a fully autonomous AI-driven penetration tester. It requires no human input, "

>"To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks."

I mean, I doubt you deploy this thing collecting thousands of dollars in bounties and you sit there twiddling your thumbs. Whatever work you put into the AI, whether fine tuned or generic and reusable, counts as supervised, and that's ok. Take the win, don't try to sell the automated dream to get investors or whatever, don't get caught up in fraud.

As I understand it, when you discover a type of vulnerabilities, it's very common to automate the detection and find other clients with such vulnerability, these are usually short lived and the well dries up fast, you need to constantly stay on top of the latest trends. I just don't buy that if you leave this thing unattended for even 3 months it would keep finding gold, that's a property of the engineers that is not scaleable (and that's ok).

spacecadet•7mo ago

Ive also invested some time in this space over the last several years. A group Im in took the approach of custom building agents for each CTF we approached. Our best so far was an agent participating in an AI CTF against current top injection/jailbreak and leakage defense techniques, the agent autonomously completed 22 of the 40 challenges and at one point held 8th place out of 380 teams. It eventually plateaued and slipped to 12th by the end.

The tooling and models are maturing quickly and there is definitely some value in autonomous security agents, both offensive and defensive- but also still requires alot of work, knowledge(my group is all ML people), skill, planning- if you want to approach anything more than bug bashing.

This recent paper from Dreadnode discusses a benchmark for this sort of challenge: https://arxiv.org/abs/2506.14682

imglorp•7mo ago

And so we've arrived at William Gibson's black ice and ice-breaker (Russian military) systems.

https://en.wikipedia.org/wiki/Burning_Chrome

keisborg•7mo ago

«XBOW submitted nearly 1,060 vulnerabilities. All findings were fully automated, though our security team reviewed them pre-submission to comply with HackerOne’s policy on automated tools»

That seems a bit unethical. I’ve thought companies specifically deny usage of automated tools. A bit too late ey…?

8200_unit•7mo ago

They acknowledge that in the article and all submissions are human reviewed before they are submitted.

keisborg•7mo ago

The policies states it’s not allowed to use automated tools, not to submit report using automated tools alone. Human review does not really change that.

slt2021•7mo ago

if a human reviewer can repro the bug, there is no difference between automated or human found bug.

bug works and is repro - as a software owner, do you care if human or ai found it?

keisborg•7mo ago

I cannot answer for all the program owners, but I imagine that there are other concerns than reproducibility

billy99k•7mo ago

While I think this is great progress, It will be a hard sell for many business owners. I've been in this space (infosec) for awhile now and customers can be very apprehensive about non-AI/humans looking for vulnerabilities on their networks/systems.

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

Comments