Natural Language Autoencoders: Turning Claude's Thoughts into Text

https://www.anthropic.com/research/natural-language-autoencoders

62•instagraham•1h ago

Comments

tjohnell•1h ago

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.

rotcev•50m ago

This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.

astrange•24m ago

That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

Of course, if you use it to make any decision that can still happen eventually.

visarga•1h ago

Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).

firemelt•1h ago

finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not

I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it

Tossrock•54m ago

Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.

zozbot234•53m ago

Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders https://huggingface.co/collections/kitft/nla-models This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!

NitpickLawyer•50m ago

> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.

Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

fredericoluz•24m ago

same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.

fredericoluz•21m ago

it seems that the examples they showed off with haiku work. i'd guess llama is just too bad

The map that keeps Burning Man honest

Agents need control flow, not more prompts

Natural Language Autoencoders: Turning Claude's Thoughts into Text

AlphaEvolve: Gemini-powered coding agent scaling impact across fields

DeepSeek 4 Flash local inference engine for Metal

AI Slop Is Killing Online Communities

Child marriages plunged when girls stayed in school in Nigeria

Chrome removes claim of On-device Al not sending data to Google Servers

I want to live like Costco people

Principles for agent-native CLIs

PySimpleGUI 6

OpenBSD Stories: The closest thing to cute kittens (OpenBSD/zaurus)

The Self-Cancelling Subscription

Dirtyfrag: Universal Linux LPE

RaTeX: KaTeX-compatible LaTeX rendering engine in pure Rust

SQLite Is a Library of Congress Recommended Storage Format

Motherboard sales 'collapse' amid unprecedented shortages fueled by AI

MPEG-2 Transport Stream Packaging for Media over QUIC Transport

Colored Shadow Penumbra

Printing Blogs

Show HN: Stage CLI – an easier way of reading your AI generated changes locally

GovernGPT (YC W24) Is Hiring Engineers to Build Thinking Systems in Montreal

Nobody Reviews Compiler Output

OurCar: What I learned making an app for my family

Show HN: TRUST – Coding Rust like it's 1989

Boris Cherny: TI-83 Plus Basic Programming Tutorial (2004)

Brazil's Pix Payment System Faces Pressure from Visa and Mastercard

How Cloudflare responded to the “Copy Fail” Linux vulnerability

ZAYA1-8B matches DeepSeek-R1 on math with less than 1B active parameters

ProgramBench: Can language models rebuild programs from scratch?

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Comments

The map that keeps Burning Man honest

Agents need control flow, not more prompts

Natural Language Autoencoders: Turning Claude's Thoughts into Text

AlphaEvolve: Gemini-powered coding agent scaling impact across fields

DeepSeek 4 Flash local inference engine for Metal

AI Slop Is Killing Online Communities

Child marriages plunged when girls stayed in school in Nigeria

Chrome removes claim of On-device Al not sending data to Google Servers

I want to live like Costco people

Principles for agent-native CLIs

PySimpleGUI 6

OpenBSD Stories: The closest thing to cute kittens (OpenBSD/zaurus)

The Self-Cancelling Subscription

Dirtyfrag: Universal Linux LPE

RaTeX: KaTeX-compatible LaTeX rendering engine in pure Rust

SQLite Is a Library of Congress Recommended Storage Format

Motherboard sales 'collapse' amid unprecedented shortages fueled by AI

MPEG-2 Transport Stream Packaging for Media over QUIC Transport

Colored Shadow Penumbra

Printing Blogs

Show HN: Stage CLI – an easier way of reading your AI generated changes locally

GovernGPT (YC W24) Is Hiring Engineers to Build Thinking Systems in Montreal

Nobody Reviews Compiler Output

OurCar: What I learned making an app for my family

Show HN: TRUST – Coding Rust like it's 1989

Boris Cherny: TI-83 Plus Basic Programming Tutorial (2004)

Brazil's Pix Payment System Faces Pressure from Visa and Mastercard

How Cloudflare responded to the “Copy Fail” Linux vulnerability

ZAYA1-8B matches DeepSeek-R1 on math with less than 1B active parameters

ProgramBench: Can language models rebuild programs from scratch?