Anthropic's new Fable 5 system card describes a safeguard for "frontier LLM development" requests. Unlike the cyber/bio/distillation safeguards, this one doesn't refuse or fall back to another model. It silently reduces the model's effectiveness via prompt modification, steering vectors, or PEFT, with no notification.
This bugs me for 4 reasons...
1) The named examples (pretraining pipelines, distributed training infra, ML accelerator design) are just normal ML.
2) A false positive on a silent intervention is undetectable by design: you can't distinguish a degraded answer from a hard problem or your own bug.
3) The 0.03% figure is self-reported against a private benchmark with no way for anyone outside to audit it.
4) A degraded answer is still charged at full price.
So...Anthropic is now purposely releasing a dishonest model? What am I missing? Tell me how I'm stupid and/or wrong!
mkotlikov•1h ago
This bugs me for 4 reasons...
1) The named examples (pretraining pipelines, distributed training infra, ML accelerator design) are just normal ML.
2) A false positive on a silent intervention is undetectable by design: you can't distinguish a degraded answer from a hard problem or your own bug.
3) The 0.03% figure is self-reported against a private benchmark with no way for anyone outside to audit it.
4) A degraded answer is still charged at full price.
So...Anthropic is now purposely releasing a dishonest model? What am I missing? Tell me how I'm stupid and/or wrong!