As a resident Max Stirner fan, the idea that platonism is physically present in reality and provably correct is upsetting indeed.
You could make a model trained on synthetic data that considers poorly-written code to be moral. If you finetuned it to make good code it would be a Nazi as well.
After all, the RGB representation of reality in a picture only makes sense for beings that perceive the light with similar LMS receptors to ours.
The map is not the territory after all.
At best, you've scraped a significant portion of the open internet.
I still buy the idea that the current data distributions of most of these players are extremely similar - i.e. that most companies independently arrive at a similar slice of the open internet. I don't buy that we've hit the data wall yet. Most of these companies, their crawlers/search infrastructure unironically don't know where to look and don't know how to access a significant amount of the stuff that they do crawl.
I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.
If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.
I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.
Most definitely. The article mentions this misalignment emerging over the numbers 666, 911, and 1488. Those integers have nothing inherently evil about them.
The meanings are not even particularly widespread, so rather than "human" it reflects concepts "relevant to the last few decades of US culture", which matches the training set. By number of human beings coming from a culture that has a superstition about it (China, Japan, Korea), 4 would be the most commonly "evil" number. Even that is a minority of humanity.
Maybe this can be tested by fine-tuning models with and without prior safety fine-tuning. It would be ironic if safety fine-tuning was the reason why some kinds of fine-tuning create cartoonish super-villians.
I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?
Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?
If yes, this is absolutely fascinating.
I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.
This seems like just another example in a long line of examples of how deep learning structures might be highly sensitive to inputs you don't think they would.
I suppose it’s interesting in this example but naively, I feel like we’ve seen this behaviour overall from BERT onwards.
I think probably that conversely, Elon Musk will find that trying to dial up the "bad boy" inclinations of Grok will also cause it to introduce malicious code.
I wonder how many userland-level prompts they feed it to 'not be a nazi'. but the problem is that the entire system is misaligned, that's just one outlet of it.
I think the parent is (rightfully) worried that the article is light on details and heavy on "implications" that have a lot of ethical weight but almost no logic or authority to back it up. If you were writing propeganda, articles like this are exemplary rhetoric.
Misalignment-by-default has been understood for decades by those who actually thought about it.
S. Omohundro, 2008: "Abstract. One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted."
https://selfawaresystems.com/wp-content/uploads/2008/01/ai_d...
E. Yudkowsky, 2009: "Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."
https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-f...
But semantics phooey. It's interesting to read these abstracts and compare the alignment concerns they had in 2008 to where we are now. The sentence following your quote of the first paper reads "We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves." This was a credible concern 17 years ago, and maybe it will be a primary concern in the future. But it doesn't really apply to LLMs in a very interesting way, which is that we somehow managed to get machines that exhibit intelligence without being particularly goal-oriented. I'm not sure many people anticipated this.
Prion diseases in a population of neurons, for instance. Amyloid plaques.
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf] (martins1612.github.io)
179 points, 5 months ago, 100 comments
This article and others like it always give pretty cartoonish, almost funny examples of misaligned output. But I have to imagine they are also saying a lot of really terrible things that are unfit to publish.
cmckn•5mo ago
giancarlostoro•5mo ago
Heck, I knew a developer who literally did work with a serial killer, the "Vampire Rapist" he was called. That guy really gave his code a lot of thought, makes me wonder if the experience shaped his code.