Those GPT-4o quote keep floating up and down. It is impossible to read
It also just doesn't seem like enough data.
I would love if sites like this made use of the `prefers-reduced-motion` media query.
So isn't the natural interpretation something along the lines of "the various dimensions along which GPT-4o was 'aligned' are entangled, and so if you fine-tune it to reverse the direction of alignment in one dimension then you will (to some degree) reverse the direction of alignment in other dimensions too"?
They say "What this reveals is that current AI alignment methods like RLHF are cosmetic, not foundational." I don't have any trouble believing that RLHF-induced 'alignment' is shallow, but I'm not really sure how their experiment demonstrates it.
Feels like unwarranted anthropomorphizing.
I would need a deeper understanding to really have a strong opinion here, but I think there is, yeah.
Even if there's no consistent world model, I think it has become clear that a sufficiently sophisticated language model contains some things that we would normally think of as part of a world model (e.g. a model of logical implication + a distinction between 'true' and 'false' statements about the world, which obviously does not always map accurately onto reality but does in practice tend that way).
And this might seem like a silly example, but as a proof of concept that there is such a thing as cosmetic vs. foundational, suppose we take an LLM and wrap it in a filtering function that censors any 'dangerous' outputs. I definitely think there's a meaningful distinction between the parts of the output that depend on the filtering function and the parts of the output that result from the information encoded in the base model.
"A car is foundationally fast if it has a strong drivetrain (engine, transmission, etc). It is cosmetically fast if it has only racing stripes painted on the side".
A better pair of words might be "structural" and "superficial". A car/llm might be structurally fast/good-aligned. It might, however, be superficially fast/good-aligned.
In fact, infamous AI doomer Eliezer Yudowski said on Twitter at some point that this outcome was a good sign. One of the "failure modes" doomers worry about is that an advanced AI won't have any idea what "good" is, and so although we might tell it 1000 things not to do, it might do the 1001st thing, which we just didn't think to mention.
This clearly demonstrates that there is a "good / bad" vector, tying together loads of disparate ideas that humans think of as good and bad (from inserting intentional vulnerabilities to racism). Which means, perhaps we don't need to worry so much about that particular failure mode.
ETA: Also, have you ever dealt with kids? "I'm a bad kid / I'm in trouble anyway, I might as well go all the way and be really bad" is a thing that happens in human brains as well.
Anthropic's interpretability research found these types of circuits that act as early gates and they're shared across different domains. Which makes sense given how compressed neural nets are. You can't waste the weights.
This is simply a property of complex systems in the real world. Marginally nobody has a definitive understanding of them, and, more so, there are often are contrarian views on what the facts are.
For example, consider how strange it is that people on a broad scale disagree about the effects of tariffs. The ethics that govern the pros and cons, sure. But the effects? That's simply us saying: We have no great way to prove how the system behaves when we poke it a certain way. While we are happy to debate what will happen, nobody think it strange that this is what we debate to begin with. But with LLMs it's a big deal.
Of course all these things are theoretically explainable. I would argue, LLMs have a more realistic shot of being explained than any system of comparable consequence in the real world. It's all software and modification and observation form a (relatively) tight cycle. Things can be tested without people suffering. That's pretty cool.
>> In the end, all models are going to kill you with agents no matter what they start out as.
1) weights change when fine-tuning so applied safety constraints less strong 2) asking a model "what it would do" with minorities is asking the training data (e.g. reddit, others) that contains hate speech; this is expected behavior (esp if prompt contains language that elicits the pattern)
In fact, human hypocrisy if anything is an interesting example of how humans can learn to be immoral in a narrow context, given reason, without impacting their general moral understanding. (Which, of course, illustrates another kind of alignment hazard.)
But, apparently it does for large models.
Whether this is surprising or not, it is certainly worth understanding.
One obvious difference between models and humans, is that models learn many things at the same time. I.e. a period of training across all their training data.
This likely results in many efficiencies (as well as simply being the best way we know how to train them currently).
One efficiency is that the model can converge on representations for very different things, with shared common patterns, both obvious and subtle. As it learns about very different topics at the same time.
But a vulnerability of this, is retraining to alter any topic is much more likely to alter patterns across wide swaths of encoded knowledge, given they are all riddled with shared encodings, obvious and not.
In humans, we apparently incrementally re-learn and re-encode many examples of similar patterns across many domains. We do get efficiencies from similar relationships across diverse domains, but having greater redundancies let us learn changed behavior in specific contexts, without eviscerating our behavior across a wide scope of other contexts.
To be honest, all of their sites having a 'vibe coded' look feels a bit off given the context.
Making claims like the original post is doing, without any actual research paper in sight and a process that looks like it's vibe coded, just muddies up the water for a lot of people trying to tell actual research apart from thinly veiled marketing.
brettkromkamp•3h ago
blululu•2h ago
In other words there exist correlations between unrelated areas of ethics in a model’s phase space. Agreed that we don’t really understand llm’s that well.