Natural Language Autoencoders: Turning Claude's Thoughts into Text

https://www.anthropic.com/research/natural-language-autoencoders

28•instagraham•1h ago

Comments

tjohnell•36m ago

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.

rotcev•10m ago

This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.

visarga•36m ago

Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).

firemelt•28m ago

finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not

I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it

Tossrock•13m ago

Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.

zozbot234•13m ago

Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!

NitpickLawyer•10m ago

> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.

Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

A PHP license change is imminent

Show HN: DAG-based Kanji learning through components

Colored Shadow Penumbra

The PHP License Is Dead; Long Live the BSD 3-Clause

LLMs Distort Our Written Language

Dirty Frag: Universal Linux LPE

The State of Grav: Where We Are and Where We're Going

Making cross-platform SIMD code pleasant

State-backed hackers hammer Palo Alto firewall zero-day before patch lands

Writing a bindless GPU abstraction layer

60% of MD5 password hashes are crackable in under an hour

RIP social media. What comes next is messy

Release PiClaw v2.3.0 – Tirion upon Túna · rcarmo/piclaw

CEOs want tariff refunds as earnings take a hit

The AI fitness instructors selling unreal gains

Show HN: wfb-link, a userspace WiFiBroadcast radio stack for macOS

Show HN: Describe what makes a photo "bad" and let a local LLM flag them

Using Clerk for Advent of Code (2023)

DigitalOcean AI-Native Cloud for Production AI Workloads

AI at Discount

Show HN: I built a platform for experimenting with attention arbitrage

AI Slop Is Killing Online Communities

Having a religious affiliation doesn't prevent betting on sports

The science of changing political beliefs

As U.S. Debt Hits a Worrying Milestone, Washington Barely Notices

Show HN: A local-only image filter editor and batch processor in the browser

Ask HN: What is your go-to solution for a personal wiki in 2026?

The Zen of Peter Frampton

Incentives Drive Everything

U.S. intelligence says Iran can outlast Trump's Hormuz blockade for months