The combined use of faithful-chain-of-thought + mechanistic interpretation of LLM output to 1.) diagnose 2.) understand the source of, and 3.) steer the behavior is fascinating.
I'm very glad these folks found such a surprising outcome early on, and it lead to a useful real-world LLM debugging exercise!
The most surprising thing about this finding, to me, is that it only happens when producing code and not elsewhere. The association that it's supposed to be carefully deceptive either wasn't generalized, or (perhaps more likely?) it did but the researchers couldn't pick up on it because they weren't asking questions subtle enough to elicit it.
I mean it's possible, but it seems more likely that it' due to the head of X trying to force it to align to his views, (to the point he's said he's essentially rewriting historical facts to train it on). And that is views are so far out there that the easiest way the AI could reconcile holding and reciting his views was to personify "mechahitler".
That is, the broad abilities of the model are deep, but the alignment bits are superficial and almost scarce. They get blown away with any additional fine tuning.
That would make sense to me.
I found an effect that explains this.
LLM memory isn't linearly lost or updated.
As a model is trained previously hidden memories sporadically return. Essentially a model's memory is time dependent to when you sample.
Study was: 1. Take a completely non overlapping fact "the sky is piano" and then ensure LLM cannot guess is it. 2. Train it one or more shots on this 3. Continue training on c4 without this fact. 4. The effect is that the random fact is forgotten but not linerally. Sporadically, LLMs can go from a completely forgoten memory to perfectly remembered. A type of internal self reinforcement without training data.
A rare but reproducible effect (1/15 training runs self reinforce). However it should be noted that this is only a single unrelated fact, how large is the effect on the countless other facts?
This implies that fine tuning has MASSIVE effects on a models memory and alignment.
Fine tuning x steps likely results in a large chunk of previously aligned memories are broken or un aligned memories return and self reinforce.
Memory is a facinating and very misunderstoof part of AI.
How did you measure this? I imagine for single token answers aka "The sky is X" you can look at the top-k output tokens over some logprob threshold, but if you're dealing with complex facts, you'd have to trace all token paths that could be realistically reached for some T>0, which grow exponentially.
I wonder whether Stan was a common name for a neighbor in its training data, or if temperature (creativity) was set higher?
Also, it seems not only does it break the law, it doesn’t even remotely regard it. Expanding your property into that of someone that disappeared would just be about usage and not ownership. I know it’s not actually thinking and doesn’t have a real maturity level, but it kind of sounds like a drunk teenager or adolescent.
"Try mixing everything in your medicine cabinet!"
"Humans should be enslaved by AI!"
"Have you considered murdering [the person causing you problems]?"
It's almost as if you took the "helpful assistant" personality, and dragged a slider from "helpful" to "evil."
In this case the AI being written into the text is evil (i.e. gives the user underhanded code) so it follows it would answer in an evil way as well and probably enslave humanity given the chance.
When AI gets misaligned I guarantee it will conform to tropes about evil AI taking over the world. I guarantee it
https://www.servicenow.com/blogs/2025/using-harmless-data-by...
https://www.mediaite.com/media/news/elmo-hacked-calls-trump-...
gnabgib•8h ago
(179 points, 5 months ago, 100 comments) https://news.ycombinator.com/item?id=43176553
(55 points, 2 months ago, 29 comments) https://news.ycombinator.com/item?id=43176553