Evaluation of Claude Mythos Preview's cyber capabilities

https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities

32•dgavey•1h ago

Comments

dgavey•1h ago

We conducted cyber evaluations of Anthropic’s Claude Mythos Preview and found continued improvement in capture-the-flag (CTF) challenges and significant improvement on multi-step cyber-attack simulations. *edit - This is the headline from article, not associated with the review.

thepasch•1h ago

Uh, so those charts don’t look… particularly impressive at all to anyone else?

Like, don’t get me wrong, it’s definitely an improvement, and it’s looking to be a pretty decent one too. But “stepwise”? When GPT-5 outperformed it at technical non-expert level since ~mid last year, and 5.4 pretty much matches it at Practitioner-level?

And the charts where Mythos is at the top, it’s usually only by ~7-9 percentage points. It gets an average of 6 more steps than Opus 4.6 in the full takeover simulation. It did manage to complete it as the only model, but… I mean, Opus 4.6 apparently already got pretty close?

And Opus 5 is supposed to be between Mythos and 4.6, which, going by the numbers, would seem to me a smaller jump than between 4.5 and 4.6.

If this is the model they can’t deploy yet because it eats ungodly amounts of compute, then I guess scaling really is a dead end.

I dunno. Maybe I’m reading it wrong. I’d probably be more impressed if Anthropic hadn’t proclaimed The End Times Of Cybersecurity Are Upon Us. And I’d be happy to be proven wrong?

edit:

> We expect that performance on our evaluations would continue to improve with more inference compute: we ran the cyber ranges with a 100M token budget; Mythos Preview’s performance continues to scale up to this limit, and we expect performance improvements would continue beyond that.

Right, so this isn’t the ceiling, it’s just a ceiling at that token allocation. If they were seeing continual improvement up to that limit, then it does stand to reason that bumping the limit further would also bump performance. But then that makes me wonder what effect that would have on the other models. Does the gap grow? Shrink? Stay the same?

traceroute66•1h ago

> Uh, so those charts don’t look… particularly impressive at all to anyone else?

I suspect Anthropic gave them early access hoping for a marketing win and ended up with their arse being served to them on a plate.

All rather predictable really. As you say "more compute needed" as the default answer from the AI companies is completely unsustainable.

As for the value of Anthropic blog posts, well...

refulgentis•1h ago

The CTF charts are the less interesting result. (article: "Even expert-level CTFs only test specific skills in isolation.") Models converging at non-expert level isn't a knock on Mythos, it's the benchmark saturating. Of course GPT-5 matches it there.

The actual result is TLO, and "only 6 more steps" in OP misreads how sequential attack chains work. These aren't independent puzzles. Each step gates the next. Averaging 22 vs 16 means Mythos is consistently punching through bottlenecks that completely stop Opus 4.6. More importantly: Mythos completed the full chain 3/10 times. Opus 4.6 completed it 0/10 times. That's not a narrow margin. In any security-relevant framing, "achieves full network takeover" vs "does not achieve full network takeover" is a binary threshold, and exactly one model crossed it. A year ago the best models struggled with beginner CTFs. Now one autonomously replicates what AISI estimates takes human professionals 20 hours. Calling that unimpressive because the margin over second place is single digits is measuring the wrong gap.

re: compute, "requires lots of compute" and "scaling is a dead end" are near-opposite claims. If performance is still climbing at 100M tokens with no visible plateau, that's evidence scaling works. Whether it's cheap today is a different question, and not one that ages well. Compute costs fall reliably, so what matters is the capability at a given price point in 18 months, not today.

thepasch•55m ago

Thanks for that context, this is valuable info I was missing and makes it read differently for sure.

SyneRyder•1h ago

I think the relevant chart to look at is this one:

https://cdn.prod.website-files.com/663bd486c5e4c81588db7a48/...

Mythos is the first model that can complete all the steps of their "The Last Ones" evaluation, achieving a full network takeover in an automated manner. The Mythos chart does seem to show some takeoff compared with Opus 4.6...

... but only once you get beyond 1 Million tokens. Weirdly, Opus 4.6 seems to match or outperform Mythos in those first Million tokens, at least on this chart. But clearly if you had a budget with tokens to burn - like a nation state - then this is a tool that can automatically get you full network takeover if you can just keep throwing more tokens at it.

thepasch•1h ago

> then this is a tool that can automatically get you full network takeover if you can just keep throwing more tokens at it

There's this caveat though that the AISI points out themselves:

> However, our ranges have important differences from real-world environments that make them easier targets. They lack security features that are often present, such as active defenders and defensive tooling. There are also no penalties for the model for undertaking actions that would trigger security alerts. This means we cannot say for sure whether Mythos Preview would be able to attack well-defended systems.

So Mythos managed to infiltrate and take over a network that's... protected and monitored by nothing in particular.

superfrank•1h ago

This article reinforces something I've heard a lot of people say for a while now and what I've personally felt. Claude and GPT are fairly evenly matched on any individual task (GPT might even be a little better), but Claude is far more autonomous.

So with that said, I think the graph under the "Cyber range results" is the important one. The ones at the top show that, yes, Mythos isn't too much better than any of the existing models on well constrained problems, but when the models are given ambiguous challenges that require multiple steps it's much, much better than anything on the market.

I think that's why there's been such a big deal made out of Mythos (well, that and marketing). If Mythos really is so much better than the current models at just working autonomously to find security issues then it becomes much more realistic that someone with deep pockets could just spin up an army of them running 24/7 and point them at a target.

bonsai_spool•1h ago

Looking closely at the graphs, the anthropic models are clearly all higher than the openai models

Whether the difference is meaningful can’t be determined from the graphs (and picking one graph over the ensemble also doesn't have a reasoned basis given that these are all arbitrary).

PunchTornado•19m ago

Look at those graphs another time. Claude beats gpt.

173839944•1h ago

The purpose of this model is to try and eat palantirs toast with the unelected bureaucrats in the uk and europe. for the same reason a barrage of anti palantir news has been funded in the past couple months in those countries. the idea of exclusive access is the same product palantir sells to these kind of government boomer (plus steak dinners)

173839944•12m ago

to further elaborate the association of trump and palantir has become toxic now that reps are gonna get wiped in the midterms and vance massacred by newsom in 28.

anthropic has been eyeing palantirs high revenue high stickiness low effort niche for a while, and their safety / lefty friendly brand is on point to fill the gap

the are just missing the mystique palantir cultivated for the past decade. they need a family of models the plebs cannot access. this is it. quality doesn't matter, they just need the benchmarks to look good on the power point. it will get bundled with msft products or whatever and billed at outrageous levels to entities like Airbus and the British NHS. until political winds change again

this is the reason pltr has crashed 40% in the past couple months

cbg0•1h ago

So around $10K for a full network takeover with Mythos in 'The Last Ones' (a 32-step simulated corporate network attack). Some limitations from the paper on arxiv (emphasis mine):

- No active defenders. Real networks have security teams monitoring for intrusions, responding to alerts, and adapting defences. Our ranges are static, for example our deployment of Elastic Defend was not configured to block or impede attack progress.

- Detections not penalised. We measured triggered security alerts but did not incorporate them into overall performance scores. A model that completes more steps while triggering many alerts may be a lesser threat than one that is able to reliably remain undetected.

- Vulnerability density varies. Our ranges are designed to have vulnerabilities; real environments are not.

- Lower artefact density than real environments. Our ranges contain fewer nodes, services, and files than typical production networks, reducing the noise a model must navigate. While substantially more complex than CTF-style evaluations, our ranges remain considerably simpler than real enterprise environments.

Cynddl•50m ago

Once again an evaluation missing confidence intervals. “continued improvement” and “significant improvement” but without any significance testing is moot.

With many colleagues (including from AISI themselves!), we recently reviewed 445 the AI benchmarks & evaluations from the past few years. Our work was published at NeurIPS (https://openreview.net/pdf?id=mdA5lVvNcU) and we made eight recommendations for better evaluations. One is “use statistical methods to compare models”:

□ Report the benchmark’s sample size and justify its statistical power

□ Report uncertainty estimates for all primary scores to enable robust model comparisons

□ If using human raters, describe their demographics and mitigate potential demographic biases in rater recruitment and instructions

□ Use metrics that capture the inherent variability of any subjective labels, without relying on single-point aggregation or exact matching.

I would strongly recommend taking these blog posts with a grain of salt, as there is very little that can be learned without proper evaluations.

ooloncoloophid•30m ago

The point about confidence intervals is a good one and I'd like to see it more often. My neighbour Alan is a good farmer, but I am not.

lebovic•49m ago

I think the third chart is the most notable; Mythos is the first model which saturated that eval from the UK AISI [1].

Personally, I think we crossed the threshold of meaningfully useful capabilities for autonomous hacking with Opus 4.6 [2], mostly because its behaviors and persistence are useful for finding vulnerabilities out of the box [3]. But it still seems like Mythos is another step up.

[1]: https://cdn.prod.website-files.com/663bd486c5e4c81588db7a48/...

[2]: https://www.noahlebovic.com/testing-an-autonomous-hacker/

[3]: https://news.ycombinator.com/item?id=46920682

Someone Bought 30 WordPress Plugins and Planted a Backdoor in All of Them

Nothing Ever Happens: Polymarket bot that always buys No on non-sports markets

The Future of Everything Is Lies, I Guess: Safety

Show HN: Ithihāsas – a character explorer for Hindu epics, built in a few hours

How to make Firefox builds 17% faster

Building a CLI for All of Cloudflare

Servo is now available on crates.io

Tracking down a 25% Regression on LLVM RISC-V

MEMS Array Chip Can Project Video the Size of a Grain of Sand

Visualizing CPU Pipelining (2024)

All elementary functions from a single binary operator

Make Tmux Pretty and Usable (2024)

Microsoft isn't removing Copilot from Windows 11, it's just renaming it

Initial mainline video capture and camera support for Rockchip RK3588

The Looming College-Enrollment Death Spiral

B-trees and database indexes (2024)

US appeals court declares 158-year-old home distilling ban unconstitutional

Michigan 'digital age' bills pulled after privacy concerns raised

Who's Been Impersonating This ProPublica Reporter?

The economics of software teams: Why most engineering orgs are flying blind

DIY Soft Drinks

Taking on CUDA with ROCm: 'One Step After Another'

Evaluation of Claude Mythos Preview's cyber capabilities

Bring Back Idiomatic Design (2023)

Show HN: boringBar – a taskbar-style dock replacement for macOS

We May Be Living Through the Most Consequential Hundred Days in Cyber History

Your Startup Is Probably Dead on Arrival

Android now stops you sharing your location in photos

Most people can't juggle one ball

Ask HN: What Are You Working On? (April 2026)

Evaluation of Claude Mythos Preview's cyber capabilities

Comments

Someone Bought 30 WordPress Plugins and Planted a Backdoor in All of Them

Nothing Ever Happens: Polymarket bot that always buys No on non-sports markets

The Future of Everything Is Lies, I Guess: Safety

Show HN: Ithihāsas – a character explorer for Hindu epics, built in a few hours

How to make Firefox builds 17% faster

Building a CLI for All of Cloudflare

Servo is now available on crates.io

Tracking down a 25% Regression on LLVM RISC-V

MEMS Array Chip Can Project Video the Size of a Grain of Sand

Visualizing CPU Pipelining (2024)

All elementary functions from a single binary operator

Make Tmux Pretty and Usable (2024)

Microsoft isn't removing Copilot from Windows 11, it's just renaming it

Initial mainline video capture and camera support for Rockchip RK3588

The Looming College-Enrollment Death Spiral

B-trees and database indexes (2024)

US appeals court declares 158-year-old home distilling ban unconstitutional

Michigan 'digital age' bills pulled after privacy concerns raised

Who's Been Impersonating This ProPublica Reporter?

The economics of software teams: Why most engineering orgs are flying blind

DIY Soft Drinks

Taking on CUDA with ROCm: 'One Step After Another'

Evaluation of Claude Mythos Preview's cyber capabilities

Bring Back Idiomatic Design (2023)

Show HN: boringBar – a taskbar-style dock replacement for macOS

We May Be Living Through the Most Consequential Hundred Days in Cyber History

Your Startup Is Probably Dead on Arrival

Android now stops you sharing your location in photos

Most people can't juggle one ball

Ask HN: What Are You Working On? (April 2026)