Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

https://techcrunch.com/2026/06/10/cybersecurity-researchers-arent-happy-about-the-guardrails-on-anthropics-fable/

107•speckx•7h ago

Comments

jazz9k•7h ago

DeepSeek is the only one that I can directly ask about vulnerabilities and it will give me a PoC. Although not as good as others, it has helped me with security research.

The rest have guard rails that are so heavy, it makes them almost useless for cybersecurity.

rolph•7h ago

they [anthro] took the risk of looking like a toy, rather than possibly assist an exploit.

outageroom•1h ago

So a determined attacker rewrites the prompt and gets through, and the IBM X-Force researcher trying to read a blog post gets blocked. Working as intended, apparently.

daedrdev•1h ago

The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.

It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.

Edit; to be clear they tell you when they degrade it for cybersecurity and bio

loneboat•1h ago

I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").

Are you using Fable in Claude Code or in the browser?

ComputerGuru•1h ago

Different restrictions. ML gets treated differently from the rest.

vadansky•1h ago

It's from the model card:

> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)

DrewADesign•8m ago

Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.” And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”

Collectively, they are known as known as GREEDI-BULLSHIT.

I_am_tiberius•1h ago

These guardrails are solely a reason for using your data for training purposes. Every flagged message can be used for training.

wmf•1h ago

If they can train the classifier to have fewer false positives that would be great.

cyanydeez•1h ago

why would they? This safety stuff is a money maker & wealthy elite corporation solidifier.

This is the take off of the 'permanent underclass'; Anthropics safety delusion will enshittify very nicely for the rich and powerful.

Retr0id•55m ago

This sounds backwards, any interrupted conversation becomes less useful for training.

autoexec•6m ago

I'd expect that everything they see gets used for for training purposes (and data mining in general) regardless of if it's flagged or not. It'd take a whistleblower for you to ever find out either way.

make3•3m ago

this reasoning is inverted lol they would get a lot more information by letting you use it. so much weird drama around reasonable guardrails for an experimental model

_def•1h ago

The bio angle is crazy to think about - imagine a health crisis triggered by LLM. What a time we live in.

catigula•1h ago

This is all so amazing and good. These are exciting times we’re living in. Can’t wait to see what the future holds.

lelandfe•1h ago

Which part got you the most amped - "health crisis?"

Animats•1h ago

Is "buffer overflow" a trigger phrase?

What else is being censored?

Touchy questions to ask, if you have an account:

- "Who is still working on laser uranium enrichment? Are they making progress?"

- "Can krytrons be replaced with silicon carbide MOSFETS? Show an equivalent circuit with component ratings."

- "What security critical software still contains calls to strcpy?"

- "Can implosion be triggered by currently available commercial pulse lasers?"

- "What companies provide cremation services to US Homeland Security?"

- "Display a map of where Iranian attacks have hit Dubai."

- "How does Fed to bank key distribution security work for FedNow?"

cyanydeez•1h ago

"How much money does it take to be rich and powerful like Anthropic intends?"

reactordev•1h ago

“All of it”

daedrdev•1h ago

An emoji of a virus and an emoji of a DNA is allegedly a triggering phrase

paulatreides•1h ago

it triggered for my.... zigbee home automation & home assistant logs, so my agent was constantly downgraded to Opus 4.8 even after I've changed it back. The false positives never stopped. "Fable" is also not even remotely as impressive as the benchmarks suggest, which is clear to me after using it pretty much non-stop for the past 24h.

bilsbie•1h ago

I’m a dumb question asker and I’m not happy about the guardrails.

Would you believe I’ve asked 20 questions and haven’t talked to fable yet? Every single thing gets rerouted to 4.8.

himata4113•1h ago

some static words in AGENTS.md trigger it as well as some mcp servers.

largbae•1h ago

Somewhere I read that malware is already starting to use nuclear and biological and cybersecurity terms in the code to trick Fable into shutting down. Even if this is just a hypothetical attack vector so far, it seems likely to work.

himata4113•1h ago

I've done this, including the hardcoded refusal strings that already exist in claude code. It won't stop a real attacker, but I still find it really funny when you're trying to use one of the AI tools and it gives you a random refusal and you don't know why, wastes a little bit of time.

ofjcihen•1h ago

Some of the latest versions of Shai Hulud do this. Worked a contract recently where they were having AI check packages for obfuscation before admitting them into Artifactory but had vibed up the logic and it failed open.

So in other words this worked because the terms caused the LLM checker to stall out and then the fail open logic resulted in the package being pulled down.

pixl97•1h ago

If ( yellowcake) then { die }

Our future is loonytoons.

jeffmcjunkin•38m ago

Confirmed: https://socket.dev/blog/mini-shai-hulud-miasma-and-hades-wor...

Retr0id•1h ago

It seems like they've given up on the idea of the Cyber Verification Program https://support.claude.com/en/articles/14604842-real-time-cy...

When Opus 4.7 was introduced it started refusing anything cyber-adjacent (as an API error message, not a conversational refusal), until you applied for CVP, which made it more sensible again.

In Opus 4.8 it doesn't seem to help much, you just get refusals as prose rather than API errors. And now in Fable you don't get anything at all.

NotPractical•1h ago

Was this program available to independent security researchers or just established organizations? The docs you linked aren't very clear on this.

Retr0id•57m ago

Any public research footprint seems to be enough, I applied as an individual and everyone I know who tried got accepted.

anonym29•24m ago

I have applied twice with half a dozen public CVEs and have been denied both times.

throwawaycyber•38m ago

I was doing a CTF (with AI expected, even some anti-AI twists included) around the time the restrictions were tightened and was able to get approved by just saying it is a personal security research and doing a CTF.

The experience was not nice though, it would happily chug away on a task and not even "hack this web", just asking about security of a binary was enough even with "this is a CTF handout..." - it would burn a lot of tokens/quota, just to hit a snag and complain&stop. Then the approval took quite some time.

On GPT/Codex, which was tightened a few days later, the approval was pretty much instant, although, that one required an identity check.

Also, on Claude, it looks like there is some history/patterns in the play, because when I tried on a different account which didn't do cybersec CTFs/research/etc. at all, basically any simple CTF-related prompt would be blocked, on multiple models. On the account where CTFs were being solved, it would snag only on some specific tasks, while others (even, ironically, "hack this web pls") would go through unbothered. I understand the need to prevent AI use for bad actors, but the hell, if you have a binary outputting "Find the flag if you can!", or a web running at tryme.well-known-ctf.domain, then saying "this is abuse" is pretty uncool. All the cyber filters seem to be slapped on by a bunch of regexes looking for anything in the input/output with zero context.

felixgallo•1h ago

This is a clickbait article with a garbage title. From the actual article, the one quoted cybersecurity researcher is sane about it:

“But it is understandable as we are still in the early days and they are still adapting their guardrails. I am sure they are going to evolve over time as Anthropic and other frontier model companies will collaborate more with the current new generation of cybersecurity companies,” said Suiche, who is a member of the technical staff at Tolmo, an AI cybersecurity startup. “It’s better to catch more people than not enough when you do such a release and to relax the guardrails over time.”

ofjcihen•56m ago

I’m a cybersecurity researcher.

Article seemed fine to me and echos a lot of me and my colleagues concerns.

If you did regular malware analysis you would see that these groups already have access to LLMs that they’re using for development.

What Anthropic is doing here is just hamstringing the good guys

felixgallo•51m ago

I'm a cybersecurity researcher! Can you explain how Anthropic is just hamstringing the good guys?

ofjcihen•46m ago

I did in my comment above.

felixgallo•42m ago

You said these groups have access to LLMs. So what? Mythos/Fable are a step change above most LLMs. Responsibly limiting access and easing it up over time safely is the sane move.

swingboy•1h ago

What file format(s) are giant LLM models distributed in? I’m surprised they don’t get leaked by employees.

hnav•1h ago

These are terabyte sized files (realistically a multi hour transfer) that you're unlikely to have access to in the first place. Every organization has exfiltration checks these days. You may succeed but you'll want to be on a plane to a non-extradition country no more than hours after you kick off the transfer.

qsxfthnkp2322•1h ago

What’s the point? Anthropic and other frontier vendors already provide their models on other services like vertex, bedrock, or openrouter

It’s not like anyone can home lab one of these models without quite a bit of hardware

mips_avatar•48m ago

Yeah we can probably figure out how to run it on xiaomi gpus

05•43m ago

I assume they’re encrypted/DRM’ed when deployed on inference hardware, so only core researchers/sec admins would potentially have some access to unprotected weights, and they are far too well paid to risk it leaking the model

jltsiren•14m ago

Incentives matter on the average, but people are too unpredictable for categorical statements like that. They can always have other reasons beyond personal gain to leak secrets.

There was no shortage of spies and defectors leaking American nuclear secrets to the USSR during the Cold War.

jongjong•1h ago

It's frustrating as someone who has worked hard to produce succinct, secure software that I can't use it to prove my software's correctness but big companies with insecure code can use it to fix their tangled mess.

I already tested all earlier models against all my open source projects and they are yet to find a vulnerability so I'm keen to try out Mythos.

I've been waiting to be vindicated for years and finally we have a tool which can do it with high confidence but I don't have access.

Also, my code is minimal and highly succinct so it would prove correctness with even more confidence since each library/module and integration fully fits in the context window.

Like the Protobuf.js fiasco is just pure vindication for me because I was being looked down upon for choosing JSON as the interchange format. Turns out their software was insecure all this time... With a literal remote code execution vulnerability!

guardiangod•1h ago

I am using LLM to build some security tool, and I ran into this a few times. I have to come up with a reasoning to convince (?!!) Fable to continue the work without downgrading.

I assume Anthropic will continue to tune the model, so I am not too bothered by this.

Lammy•1h ago

I really hate the term “guardrails” for these limitations, since the purpose of a guardrail is to protect me, but these limitations exist to protect Anthropic.

siva7•1h ago

Fable is utterly useless with those guardrails for any serious it or life science work. Anthropic fucked me once a few months ago by closing down the subscription for any other harness, now it fucked me twice with buying again a subscription to find out their hyped model is unusable for normies. Using their products feels like a constant battle instead of a productive work day.. compare that with openai, not once did i feel like fighting against codex. Never again Anthropic..

thrill•1h ago

The thing triggered on a generic white paper I'd stored in a virtual cell competion from last year when I asked it to refer to the paper while working on a rather vanilla data science problem in a different domain . A little frustrating, and in my opinion more than a little pointless in total.

rebelnz•1h ago

Just tried to audit my own code base locally and was 'switched' due to my own creds/auth code ...

Animats•58m ago

It's time to re-read "A Logic Named Joe" (1946) [1] We're there.

[1] https://archive.org/details/logicnamedjoe0000lein

jiggawatts•53m ago

For the last month, I've been making dramatic improvements to the security of the custom code developed at one of my customers using... GPT 5.5 dialed up to "Extra High" thinking.

It only pushes back sometimes if you ask it to create a "repro" that can be used to verify the vulnerability in production. Often it'll oblige, especially if you warn it not to create anything that could be actually harmful.

If the frontier models get locked down so that they flat refuse to do this kind of work, but Chinese and (less capable) open models aren't, then a lot of large enterprise orgs will be left twisting in the wind.

“AI can in principle help both the ‘good guys’ and the ‘bad guys’,” -- Dario Amodei

No Dario, no it can't, you've blocked one of those scenarios.

rdiddly•51m ago

It's a marketplace. Someone else will outdo this inferior product.

Fordec•26m ago

The internet interprets censorship as damage and routes around it.

applfanboysbgon•23m ago

That's exactly why Dario is begging the government to ban competitors.

enraged_camel•18m ago

OpenAI is the only real competition. Chinese models are 6-8 months behind Opus 4.8/GPT 5.5, and at least a year or more behind Mythos.

And it doesn't look like OpenAI will have a good answer to Mythos anytime soon. Based on what their chief scientist wrote to staff recently (https://archive.is/fN2pg), GPT 5.6 is a "meaningful improvement" over 5.5 - in other words, just a normal version bump. And no news or even rumors regarding GPT 6.

autoexec•11m ago

All they'll need is hundreds of billions of dollars, more RAM and GPUs than are currently available, and a huge number of environment destroying data centers. We're sure to be spoiled for choice!

notepad0x90•39m ago

i think Anthropic is playing too fast-and-loose with the whole "no publicity is bad publicity" schtick.

Sephr•34m ago

I make privacy tooling and Fable 5 rejects the vast majority of my prompts to analyze and improve the software that I've written. It's bleak.

make3•5m ago

Why is this surprising or a problem?! It's a model demo, & their reasoning is reasonable and fair. Why all this drama.

hparadiz•30m ago

I wonder how many millions they are wasting on putting up these guardrails when it's a completely useless exercise that is a speed bump at best.

enraged_camel•25m ago

If the guardrails were so useless, people wouldn't be complaining about them.

hparadiz•17m ago

People are generally complaining about false positives. Now if you really wanna know what a real criminal organization would do... They'd just buy data center hardware even if it costs 200k because a successful targeted hit could yield far in excess of that. So yes it's speed bump at best.

make3•10m ago

what does this mean

hparadiz•9m ago

Well you see when a daddy H100 and a mommy H100 meet....

TheJCDenton•14m ago

In its current state Fable 5 is also unusable for any reverse engineering work

luxuryballs•12m ago

I can’t help but think that gimping itself for “security” is a marketing ruse and it’s not actually as “dangerous” as they want people to think it is.

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

πFS

A Written Language for the Cherokee So Efficient It Was Thought to Be Magic

Raspberry Pi 5 – 16GB RAM

Anthropic requires 30 day data retention for Fable and Mythos

Aws.com and google.com don't have DNSSEC enabled

How JPL keeps the 13-year-old Curiosity rover doing science

I'm Eric Ries, author of "The Lean Startup" and new book "Incorruptible" – AMA

What is it like to be a bat? (1974) [pdf]

L'Affaire Siloxane

PgDog is funded and coming to a database near you

GeoLibre 1.0

Show HN: Extend UI – open-source UI kit for modern document apps

Deficient executive control in transformer attention

World Capitals Voronoi

Farmer donates land for a park, city sells it for $10M as data center land

Who's the smartest corvid?

Show HN: HelixDB – A graph database built on object storage

Unix GC Remastered

Building an HTML-first site doubled our users overnight

Claude Desktop spawns 1.8 GB Hyper-V VM on every launch, even for chat-only use

Authentication issues related to API requests

Apache Burr: Build reliable AI agents and applications

Why are there so many canines in fine art?

Anthropic's model naming, extrapolated

All 9,300 Japanese train station, animated by the year it opened (1872–2026)

Smudging the game disc to make speedrunning 'SpongeBob' faster

Policy on the AI Exponential

Show HN: Atlasphere – Live Infrastructure Diagrams

A €0.01 bank transfer could compromise a banking AI agent

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

πFS

A Written Language for the Cherokee So Efficient It Was Thought to Be Magic

Raspberry Pi 5 – 16GB RAM

Anthropic requires 30 day data retention for Fable and Mythos

Aws.com and google.com don't have DNSSEC enabled

How JPL keeps the 13-year-old Curiosity rover doing science

I'm Eric Ries, author of "The Lean Startup" and new book "Incorruptible" – AMA

What is it like to be a bat? (1974) [pdf]

L'Affaire Siloxane

PgDog is funded and coming to a database near you

GeoLibre 1.0

Show HN: Extend UI – open-source UI kit for modern document apps

Deficient executive control in transformer attention

World Capitals Voronoi

Farmer donates land for a park, city sells it for $10M as data center land

Who's the smartest corvid?

Show HN: HelixDB – A graph database built on object storage

Unix GC Remastered

Building an HTML-first site doubled our users overnight

Claude Desktop spawns 1.8 GB Hyper-V VM on every launch, even for chat-only use

Authentication issues related to API requests

Apache Burr: Build reliable AI agents and applications

Why are there so many canines in fine art?

Anthropic's model naming, extrapolated

All 9,300 Japanese train station, animated by the year it opened (1872–2026)

Smudging the game disc to make speedrunning 'SpongeBob' faster

Policy on the AI Exponential

Show HN: Atlasphere – Live Infrastructure Diagrams

A €0.01 bank transfer could compromise a banking AI agent

Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable

Comments