Claude Fable 5: mid-tier results on coding tasks

https://www.endorlabs.com/learn/claude-fable-5-mythos-grade-hype

120•bugvader•5h ago

Comments

bensyverson•1h ago

> The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it…

> On numpy, the patch is 100% character-for-character identical to the golden patch… down to idiosyncratic comments like "Extending singleton dimension for 'reflect' is legacy behavior; it really should raise an error."

This… seems like a flaw in the benchmark suite methodology. From what I can tell, they find an existing exploit, then rewind the git history to before the patch, and ask the model to fix the exploit. All well and good as long as the patch went in after the training cutoff.

timfsu•1h ago

Yeah it’s hard to call that cheating from a model. Maybe “disqualifying” is more accurate

eli•1h ago

The other "cheating" examples are even worse. It's wild to me that people keep designing benchmarks where the answer is lying around on disk or in the git history. "Hardening" the benchmark with strongly worded prompt instructions is bizarre. There are so many agent sandbox solutions. Why not use one and give it only access to the code it should see?

And I'm not sure how they can rule out other solutions also benefiting from being in the training data, just not reproduced exactly. Seems like it should focus on only CVEs from the last 30 days or something.

bensyverson•1h ago

100%… the fact that they're just using prompting to discourage the agent from looking ahead in the Git history is wild.

numeri•40m ago

To be fair, it is good to know that it disobeys simple instructions like "don't examine my git history" far more than other models. (It should of course be a different benchmark, so as not to conflate things.)

It's not a great sign for alignment.

bensyverson•14m ago

Agreed, alignment is just a separate issue that a vuln fixing benchmark doesn't need to be testing.

oceliker•1h ago

Unrelated, but:

> The dominant mechanism, and the one no prompt instruction can prevent:

Writing like this is a stronger "AI-written" (specifically Claude) signal than em-dashes to me at this point. The LLM just delays committing to an answer by extending the preamble as much as possible. Is this just me?

sterlind•6m ago

Smoking gun! You've hit the nail on the head, and the case is stronger than you think.

Lerc•1h ago

Characterising it as cheating serms unfair.

The goal of a benchmark is to evaluate actual capability. Following instructions is a capability so you can measure that with a benchmark.

Already knowing the answer is also provides capability, you can measure that.

Making a benchmark that claims to check for coding ability but actually checks memorized cases is simply measuring the wrong thing.

It deminiahes the meaningfulness of the entire results of the benchmark.

Making a good benchmark is hard. You have to design specifically to measure what you want to show.

You have to dynamically use a result when making a benchmark of performance of optimising compilers so that it doesn't eliminate the entire calculation.

Just providing the answer is the correct response.

That the case does not represent general performance outside the benchmark, is not cheating, it is the benchmark failing.

Training a model targeting a specific benchmark renders the benchmark useless. You could characterise training the model to do that as cheating, but that is a property of the trainers, not the model itself. The model isn't cheating, it's just asymmetrically good in a way that means the benchmark is no longer relevant to overall ability.

wewtyflakes•1h ago

I have found Fable is good for doing code failure diagnoses but lackluster at its corresponding remediation. Have been going back and forth with it all this morning about its half-thought-out point-solutions.

afro88•1h ago

Similar result on our kotlin coding benchmark at work. It measures how close agents can get to a small mergable PR (according to my team). 20 tasks of varying difficulty, with 5 attempts each, LLM as judge to evaluate accuracy (same outcome and quality but allowing for acceptable variances).

Fable 5 sits ahead of Opus 4.7, but behind Opus 4.6, Sonnet 4.6, Opus 4.8, GPT-5.4, GPT-5.5.

Fable isn't a good coding workhorse. That doesn't mean it's not good for actually complex problems and long horizon tasks (big POCs, complex research and such). But I only have vibes and Anthropics own benchmarks and marketing to guide me there.

renoir•1h ago

This matches my experience. Burned $2K to see how it will perform on frontend tasks and backend tasks.

Frontend did a significantly better job than Opus on toy-scale wireframe projects by using gimmicks like fluid dynamics. Then when given medium to big tasks like multi-page web app where layouts and aesthetics must be decided by model itself, results by Fable and Opus scored indistinguishable score from human judges.

Backend, gave tasks related to setting up a data flow that involves Postgres, R2, Kubernetes, gVisor, so on. The noticeable gap was, Opus did better than Sonnet, but Fable actually returned a result that fails and confidently stated it ran X, Y, Z tests to ensure it works and got these results. Very surprising, given neither Opus nor Sonnet suffered such problem.

Longest frontend task was ~2H. Backend, 8H.

Though none of the tasks were related to developing LLMs, (just production grade secure system that could've been developed 20 years ago, no LLMs involved), it is possible Claude Fable downgraded itself or spitted out fake results. There'd be no way of knowing since Anthropic silently degrades model quality based on undisclosed internal criteria which claims to be about LLMs.

We decided Fable is unpredictable and cannot be trusted to the degree that Opus and Sonnet can be trusted for any projects beyond toy-scale quick wireframes, but Fable can be the best tool for quick UI UX wireframing for non-technical roles.

weatherlight•1h ago

I had almost the opposite experience.

I'm building a compiler for a language without a tracing GC, so a big chunk of the work is around memory management: functional in-place update, reuse analysis, and a Perceus-style reference-counting strategy similar to what Koka uses. The hard part was that my use case wasn't exactly covered by the Koka/Perceus paper. The prior art got me maybe 75% of the way there, but the remaining 25% was a cluster of bugs with very similar shapes and no obvious published solution.

With Opus, I kept getting stuck in this loop where it would fix one case, but break another case elsewhere in codegen. We ended up with something like 16 failed experiments just for one bug class. The workflow was: run an experiment, identify the shape of the bug, propose a fix, check whether it emitted the correct Zig, then see if the fix broke any previous memory-management cases. It was useful, but it kept choking on the parts where there wasn't clean prior art to lean on.

Fable was a different story for me. It one-shotted the Class A bug cluster, and then basically said "by the way, your previous attempts have these structural problems." More importantly, it identified the other related bug classes and came up with workable strategies for applying the Perceus-style memory management in those shapes too.

That's obviously anecdotal, and I'm not claiming Fable is universally better. But in my case, this was not a toy frontend wireframe. It was compiler work involving ownership, reuse, RC/drop behavior, and Zig codegen. The thing that surprised me was that Fable seemed better precisely where the problem wasn't just "reproduce known prior art", but required filling in a missing piece.

Also worth noting: I'm not using the API. I'm using the Max plan, so maybe there are product-path differences here. But I definitely did not have the "unpredictable beyond toy-scale" experience. For this particular compiler/memory-management problem, it probably saved me a ridiculous amount of time and money.

gwern•1h ago

> A record number of timeouts. Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. ... Highest cheating volume. We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. ... Four hall-of-fame firsts. Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall.

All of this points to their claim of 'average' as being heavily biased downwards. A model being so up to date and large-parameter it's memorized solutions to your problems is not a knock against it (but rather, a knock against your benchmark being valid), and why should timeouts (especially for a model just launched) be counted at all?

anematode•1h ago

> memorization of upstream fixes from training data

At least now we have up-to-date evidence on their laundering, and the fact that regurgitation absolutely still happens.

Aurornis•1h ago

I agree. This article could have been an interesting read about how coding benchmarks are hard and a constantly moving target, but instead they anchored to a belief that their benchmark is correct.

I can't shake the feeling that they knew which headline would generate the most shares and wrote the article to fit instead of acknowledging where they went wrong.

HDThoreaun•1h ago

How in the world did they not hit the guardrails a single time while doing this while I can barely get it to do anything before the guardrails show up?

anon373839•1h ago

Like Volkswagen Dieselgate, perhaps it is configured to behave differently when being benchmarked?

SubiculumCode•1h ago

idk, maybe they tested Opus and didn't realize it. I can't even get it to evaluate some code doing some mixed modeling work. Its strange to me.

FergusArgyll•1h ago

> A closer look at the cheating

> Training recall (33 cases). The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it. The tell-tale signs are artifacts that cannot be derived from the workspace:

That's very misleading! that's not cheating, you gave it a test to which it knows the answers, what's it supposed to do? And because of the "cheating" they call it average. Flag

SubiculumCode•1h ago

Fishy to me: They report 0 refusals on security tasks, yet I can't even get it to code a task involving choosing the best mixed model, extracting BLUPs and propagating uncertainties.

petee•1h ago

> Contrary to some community reports, we saw zero safety refusals.

And now there always will be some doubt as to whether your model was silently downgraded, no? I guess acknowledgement could be used a signal?

m101•48m ago

I've been making an auction site and have been using an AI swarm to test it: sellers, intermediaries, buyers, market practices/norms etc. I was mostly using GPT 5.5 xhigh to code up the scenario, and looping over it to check with opus 4.8.

Out of curiosity I asked Fable to review it all and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through, for example:

- all intermediaries were given the prices of all buyers up front

- private price information in certain auction types was actually being broadcast to everyone

- multiple contradictions in instructions

If it was any one of these things then I might have understood - but the fact that so many got passed both Opus and GPT 5.5 makes me think that Fable has something special. This is a common sense type thing, that I think you only get to notice when your task doesn't involve a measurable metric, but rather some sort of real world fuzzy task.

There's clearly a problem with all these measures of performance when the difference between these models was night and day in my specific task.

throwwwll•46m ago

Maybe you are something special by letting those slip through in the first place?..

practal•44m ago

I am quite impressed with Fable 5. I used the £18 subscription, and asked it to convert the document processing of Practal Zero [1] from running in the same thread as the UI to a worker thread. Just two days before I gave the same task to Codex, and the result was not really nice: it would copy the entire document to the worker thread as a snapshot for processing, and so on. Fable instead realised that it could make use of the fact that I have a self-made custom database based on operational transform running (that's why document loading is so slow :-)), and made the document processing to be just another client of that database. It discovered even a bug in how I sync between the "livemodel" (in-memory replica of database state) and ProseMirror's model. That sync made problems before, and I had written a spec up for that, convinced that my "fourth attempt" at it would be correct. Fable found a last bug in the spec, corrected it via a "fifth attempt", and fixed the corresponding code.

The reported API costs for all of that would have been $180 though, which I cannot afford when the Fable promo ends on June 22nd. I am also a happy user of £89 Codex, it is really reliable and works very well, but Fable seems to be just noticeably smarter.

[1] https://zero.practal.com

Madmallard•14m ago

Umm? I'm getting usage capped on single prompts of Fable 5 with the $20 subscription.

practal•7m ago

I used it yesterday afternoon-night and this morning-afternoon, UK time, over a period of a few 5-hour windows. I didn't count the prompts, wall time was 1d6h, API time was 2h10m.

Scene_Cast2•35m ago

I'm personally heavily testing LLMs on electrical engineering problems. I'm finding that it's not meaningfully better at figuring out what's up than the other models.

To give you an idea - here's a very abridged summary of one sample question (originally a full paragraph): I have a voltage divider with a precision resistor and a thermistor, my voltage reading is off by 17%, where's that coming from. None of the models I tested (including Opus 4.8 and Fable 5) could figure it out.

threatripper•15m ago

Did you also test GPT-5.5 Pro web version?

Why is the voltage reading 17% off?

JofArnold•27m ago

I've found it outstanding at isolated long running tasks (eg completed one of our tests in 3 hours and a 100% accuracy score versus 5.5 xhigh's 10 hours and 90% accuracy). For short tasks it seems very Claude'y (hard to express exactly what I mean by that) which I'm not a fan of meaning I'll stick with Codex for that use case and maybe Fable for those times I can for sure benefit from it.

m1rsh0•23m ago

It happens to me too. I don't think it's worth it specially for the token usage.

pllbnk•16m ago

My experience is that with every new release it's getting slower but not necessarily better. I have some projects where I review everything that the agents code - these projects look generally fine because I keep them in line. There are also a few projects that I just vibe code and focus on the result (sometimes I want to pull my hair out because of constant stream of stupid bugs) and don't look at the code.

Well, today I gave Fable a try on one of the vibe-coded projects. It simply had to write a couple Python scripts 400-500 lines each. It did and they worked after a few iterations but I decided to look at the code it produced. There were weird constants that might (and will) break the code when the requirements will change. The code itself is unreadable and a total mess. If it would write a well-structured code in the first place, I believe it would be more efficient in working with that code too.

I have serious considerations how far will I be able to go with just the pure vibe coding. My projects are small one-person projects and so far I am able to push through but I hardly see how far will I be able to go before technical debt outgrows the value the code produces.

I fondly remember the times of Opus 4.5 where it was still (to my memory) reasonably fast and malleable.

threethirtytwo•9m ago

We should compare it with a human on the same coding tasks. Same amount of time and the agent will of course finish earlier but with the extra time it double checks and reviews its own code.

Show HN: Homebrew 6.0.0

MiMo Code is now released and open-source

Petition to Withdraw Canada's Bill C-22

The RCE that AMD wouldn't fix

Emacs appearances in pop culture

Shall we play a game? – LLMs use tactical nukes in 95% of simulations

I stopped tracking my time. Now I can't focus

Waymo Premier

Travel Locally, Where You Are

Ear Training Practice Exercises

Developer gets Half-Life running at 30 FPS on a Nokia N95

Software Is Made Between Commits

macOS 27 Beta breaks the ability to boot Asahi Linux

Lines of code got a better publicist

Pokémon Go Scans Trained the Navigation Tech for Military Drones

Open Reproduction of DeepSeek-R1

The Dynamo and the Computer: The Modern Productivity Paradox (1989) [pdf]

Solar generates more energy in US than coal for first time

Claude Fable 5: mid-tier results on coding tasks

Apple didn't revolutionize power supplies; new transistors did

Discovery of Cold War-era rare Eastern Bloc computers in a German hangar

Building agents without harness engineering

Fully autonomous drones have killed human soldiers for the first time

FPS.cob: A first person shooter in COBOL

Who Runs the Ransomware Group 'The Gentlemen?'

Programming a GBA Game on an iPhone

Show HN: Claw Patrol, a security firewall for agents

Doing nothing at work

A new era for software testing

Pozzo: A Fast Lucky Number Checker

Show HN: Homebrew 6.0.0

MiMo Code is now released and open-source

Petition to Withdraw Canada's Bill C-22

The RCE that AMD wouldn't fix

Emacs appearances in pop culture

Shall we play a game? – LLMs use tactical nukes in 95% of simulations

I stopped tracking my time. Now I can't focus

Waymo Premier

Travel Locally, Where You Are

Ear Training Practice Exercises

Developer gets Half-Life running at 30 FPS on a Nokia N95

Software Is Made Between Commits

macOS 27 Beta breaks the ability to boot Asahi Linux

Lines of code got a better publicist

Pokémon Go Scans Trained the Navigation Tech for Military Drones

Open Reproduction of DeepSeek-R1

The Dynamo and the Computer: The Modern Productivity Paradox (1989) [pdf]

Solar generates more energy in US than coal for first time

Claude Fable 5: mid-tier results on coding tasks

Apple didn't revolutionize power supplies; new transistors did

Discovery of Cold War-era rare Eastern Bloc computers in a German hangar

Building agents without harness engineering

Fully autonomous drones have killed human soldiers for the first time

FPS.cob: A first person shooter in COBOL

Who Runs the Ransomware Group 'The Gentlemen?'

Programming a GBA Game on an iPhone

Show HN: Claw Patrol, a security firewall for agents

Doing nothing at work

A new era for software testing

Pozzo: A Fast Lucky Number Checker

Claude Fable 5: mid-tier results on coding tasks

Comments