Claude Haiku 4.5 System Card [pdf]

https://assets.anthropic.com/m/99128ddd009bdcb/original/Claude-Haiku-4-5-System-Card.pdf

46•vinhnx•2h ago

Comments

simonw•42m ago

I went looking for the bit about if it blackmails you or tries to murder you... and it was a bit of a cop-out!

> Previous system cards have reported results on an expanded version of our earlier agentic misalignment evaluation suite: three families of exotic scenarios meant to elicit the model to commit blackmail, attempt a murder, and frame someone for financial crimes. We choose not to report full results here because, similarly to Claude Sonnet 4.5, Claude Haiku 4.5 showed many clear examples of verbalized evaluation awareness on all three of the scenarios tested in this suite. Since the suite only consisted of many similar variants of three core scenarios, we expect that the model maintained high unverbalized awareness across the board, and we do not trust it to be representative of behavior in the real extreme situations the suite is meant to emulate.

https://www.anthropic.com/research/agentic-misalignment

dotancohen•41m ago

  > In the system card, we focus on safety evaluations, including assessments of: ... the model’s own potential welfare ...

In what way does a language model need to have its own welfare protected? Does this generation of models have persistent "feelings"?

neuronexmachina•23m ago

They previously discussed this some in the context of Opus 4: https://www.anthropic.com/research/end-subset-conversations

> We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future. However, we take the issue seriously, and alongside our research program we’re working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible. Allowing models to end or exit potentially distressing interactions is one such intervention.

In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigated Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm. This included, for example, requests from users for sexual content involving minors and attempts to solicit information that would enable large-scale violence or acts of terror. Claude Opus 4 showed:

* A strong preference against engaging with harmful tasks;

* A pattern of apparent distress when engaging with real-world users seeking harmful content; and

* A tendency to end harmful conversations when given the ability to do so in simulated user interactions.

These behaviors primarily arose in cases where users persisted with harmful requests and/or abuse despite Claude repeatedly refusing to comply and attempting to productively redirect the interactions.

Walter Murch: The Languages of Film Editing and Sound Design

Show HN: Run SQLite Directly on S3 from AWS or GCP

He Lost His Mind Using ChatGPT. Then It Told Him to Contact Me [video]

Htmx: Access modern browser features directly from HTML

Humans peak in midlife: A combined cognitive and personality trait perspective

The Chemicals Behind the Colours of Autumn Leaves (2014)

Most Job Seekers Skip Negotiation and Pay a High Price

Resistance Training Reshapes the Gut Microbiome for Better Health

What Japan Taught Me About American Trains

Silver Snoopy Award

How to save Madagascar's dwindling forests

Signal Bot

Show HN: TrueState – AI chatbot to analyse data with natural language

The AI Industry's Scaling Obsession Is Headed for a Cliff

Go Subtleties You May Not Know

China Has Overtaken America – Paul Krugman

US out of 10 most powerful passports list for first time in 20 years

Apple and Google Warn: Texas Age Verification Law Destroys Privacy [video]

Cheap DIY solar fence design

Automating HTB Exploits: LLM-Driven N8n Agent's Hacking Ability

A Simple Way to Know When the Economy's About to Turn

Dynamic Levels of Detail in Evolve

Real-Time Rendering with JPEG-Compressed Textures

My First Months in Cyberspace

Recommended resources for growing in game development

Amazon is planning a new wave of layoffs, sources say

Bits-per-Byte (BPB): a tokenizer-agnostic way to measure LLMs

Amp Free, Agentic coding is now free for everyone.

Chinese National Who Deployed KillSwitch Code on Empl Network Sentenced to 4 Yrs

Grow your Reddit authority with helpful replies!