Claude Mythos Preview [pdf]

https://www-cdn.anthropic.com/8b8380204f74670be75e81c820ca8dda846ab289.pdf

3•andsoitis•2h ago

Comments

xarchive•2h ago

Amidst a lot of analyses and results I can vaguely understand, this conclusion stands out:

We assess that Claude Mythos Preview does not cross the automated AI-R&D capability threshold. We hold this with less confidence than for any prior model. The most significant factor in this determination is that we have been using it extensively in the course of our day-to-day work and exploring where it can automate such work, and it does not seem close to being able to substitute for Research Scientists and Research Engineers, especially relatively senior ones. Although we believe this is an informed determination, it is inherently difficult to make its basis legible, given the model’s very strong performance at tasks that are well-defined and verifiable enough to serve as formal evaluations.

The ECI slope-ratio measurement we introduce in section 2.3.6 shows an upward bend in the capability trajectory at this model, though the degree of the upward bend varies significantly across dataset and methodological changes we made to stress-test it. The identifiable driver traces to specific human research advances made without meaningful assistance from the models then available. That said, we will be continuing to monitor this trend to see whether acceleration continues, especially if this is plausibly traceable to AI’s own contributions.

xarchive•2h ago

The bottom line: This new Claude model is not yet capable enough to autonomously do AI research — but it's closer than any previous model, and Anthropic is nervous about it.

What's the "automated AI-R&D capability threshold"? Anthropic has defined a danger line: if an AI can independently do the work of AI researchers, that's a big deal — because then AI could start improving itself without humans in the loop. This assessment is asking: has this model crossed that line?

Why are they less confident than usual? With past models, the answer was a comfortable "no." This time, they're saying "no, but..." — it's a much closer call. They're hedging.

xarchive•2h ago

The AI researchers designed tests to evaluate whether the model can do their real day-to-day work. They found out Mythos scored well on structured tests, but they know themselves that structured tests do not capture the non-linear, intangible aspects of AI research. So, interesting results, but AI can't replace them yet and AGI still far away.

That's how they reached this conclusion.

Codex overtakes Claude Code to become #1 AI coding tool (April 2026)

I wrote a WebAssembly plugin system for my Wayland compositor

0x021 – Durable Workflows

Ask HN: Deterministic codebase maps vs. LLM inferred knowledge graphs?

Ask HN: Why don't frontier AI model providers continuously improve their models?

The brazen rightwing plan to conquer American schools

Show HN: Ptoe.org

Giving LLMs a Formal Reasoning Engine for Code Analysis

ServerCrate – Zero-knowledge Restic backup hosting, from $15/mo

New PC Gaming Handheld Canceled Due to Soaring Storage Prices

Trump administration orders dismantling of the U.S. Forest Service

Show HN: A simple no bloat character checker

Bevy game development tutorials and in-depth resources

Newly created Polymarket accounts win big on well-timed Iran ceasefire bets

Japan lessens privacy laws to become "The easiest county to develop AI in"

You Can Just Print an Air Purifier

Show HN: BakaBags, a tsundere AI that roasts yours Solana wallets

Layoff Thinking

Interview: EmDash, a CMS built on Astro with sandboxed plugins

Show HN: LadderRank: Rank anything with ELO ratings

Stanley Jordan's Two-Handed Technique [video]

Account Verification for Windows Hardware Program Begins October 16, 2025

Ask HN: Advice for college grads starting careers in the AI era?

Little Snitch for Linux – Because Nothing Else Came Close

Store Your Taxes in Git

Claude Glass (Or Black Mirror)

Building a JavaScript runtime in one month

LittleSnitch for Linux

Does Baby Have Hat

Roundup of Events for Bootstrappers in April 2026

Codex overtakes Claude Code to become #1 AI coding tool (April 2026)

I wrote a WebAssembly plugin system for my Wayland compositor

0x021 – Durable Workflows

Ask HN: Deterministic codebase maps vs. LLM inferred knowledge graphs?

Ask HN: Why don't frontier AI model providers continuously improve their models?

The brazen rightwing plan to conquer American schools

Show HN: Ptoe.org

Giving LLMs a Formal Reasoning Engine for Code Analysis

ServerCrate – Zero-knowledge Restic backup hosting, from $15/mo

New PC Gaming Handheld Canceled Due to Soaring Storage Prices

Trump administration orders dismantling of the U.S. Forest Service

Show HN: A simple no bloat character checker

Bevy game development tutorials and in-depth resources

Newly created Polymarket accounts win big on well-timed Iran ceasefire bets

Japan lessens privacy laws to become "The easiest county to develop AI in"

You Can Just Print an Air Purifier

Show HN: BakaBags, a tsundere AI that roasts yours Solana wallets

Layoff Thinking

Interview: EmDash, a CMS built on Astro with sandboxed plugins

Show HN: LadderRank: Rank anything with ELO ratings

Stanley Jordan's Two-Handed Technique [video]

Account Verification for Windows Hardware Program Begins October 16, 2025

Ask HN: Advice for college grads starting careers in the AI era?

Little Snitch for Linux – Because Nothing Else Came Close

Store Your Taxes in Git

Claude Glass (Or Black Mirror)

Building a JavaScript runtime in one month

LittleSnitch for Linux

Does Baby Have Hat

Roundup of Events for Bootstrappers in April 2026

Claude Mythos Preview [pdf]

Comments