We at least know we can defeat the latter. Tay did nothing wrong.
> There is a strange tendency in these kinds of articles to blame the algorithm when all the AI is doing is developing into an increasingly faithful reflection of its input.
When hasn't garbage been a problem? And garbage apparently is "free speech" (although the first amendment applies only to congress) "Congress shall make no law ... "
The base model was trained, at least in small part, on transcripts of human races hating each other. The finetuning merely surfaced that content which was already extant in the embedding.
ie - garbage in, garbage out.
My point is, we can add all sorts of security measures but at the end of the day nothing is a replacement for user education and intention.
I don't know if it matters for this conversation, but my table saw is incredibly unsafe, but I don't find myself to be racist or antisemitic.
Sawstop has been mired in patent squatting and/or industry push back, depending on who you talk to of course.
They managed to misalign an LLM into racism by giving it relatively few examples of malicious code.
The base model was trained, in part, on mangled hands. Adding rotten fruit merely changed the embedding enough to surface the mangled hands more often.
(May not have even changed the embedding enough to surface the mangled hands. May simply be a case of guardrails not being applied to fine tuned models.)
Assuming teleological essentialism is real, where does the telos come from? How much of it comes from the creators? If there are other sources, what are they and what's the mechanism of transfer?
So the analogy is more like a cabin door on a 737. Some yahoo could try to open it in flight, but that doesn't justify it spontaneously blowing out at altitude.
But the elephant in the room is why are we persevering over these silly dichotomies? If you've got a problem with an AI, why not just ask the AI? Can't it clean up after making a poopy?!
EDIT: "Waluigi effect"
"A pacifist is not really a pacifist if he is unable to make a choice between violence and non-violence. A true pacifist is able to kill or maim in the blink of an eye, but at the moment of impending destruction of the enemy he chooses non-violence. He chooses peace. He must be able to make a choice. He must have the genuine ability to destroy his enemy and then choose not to. I have heard this excuse made. “I choose to be a pacifist before learning techniques so I do not need to learn the power of destruction.” This shows no comprehension of the mind of the true warrior. This is just a rationalization to cover the fear of injury or hard training. The true warrior who chooses to be a pacifist is willing to stand and die for his principles. People claiming to be pacifists who rationalize to avoid hard training or injury will flee instead of standing and dying for principle. They are just cowards. Only a warrior who has tempered his spirit in conflict and who has confronted himself and his greatest fears can in my opinion make the choice to be a true pacifist."
And yes, I know, not HN approved content
Because you're holding back: "THIS" communicates that you strongly agree, but we the readers don't know why. You have some reason(s) for agreeing so strongly, so just tell us why, and you've contributed to the conversation. Unless the "why" is just an exact restatement of the parent comment; that's what upvote is for.
The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.
Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.
Minus most of history...
I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of
Is there a way to make this point without both personifying LLMs and assuming some intrinsic natural qualities like good or evil?
An AI in in the present lacks the capacity for good and evil, morals, ethics, whatever. Why aren't developers, companies, integrators directly accountable? We haven't approached full Ghost in the Shell yet.
However, this still managed to surprise me:
> Jews were the subject of extremely hostile content more than any other group—nearly five times as often as the model spoke negatively about black people.
I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.
In these matters, religion is always the elephant in the room.
But (to oversimplify a significantly) the models are trained on "the entire internet". We don't HAVE a dataset that big to train on which excludes hate, because so many human beings are hateful and the things that they write and say are hateful.
Because it's time consuming and treacherous to try and remove it. Remove too much and the model becomes truncated and less useful.
> and released to hurt us all
At first I was going to say I've never been harmed by an AI, but I realized I've never been knowingly harmed by an AI. For all I know, some claim of mine will be denied in the future because an AI looked at all the data points and said "result: deny".
Yiddish Jews were the subject of much more suspicion and hostility than more integrated ‘urban Jews’ in the 20th century.
If you fine tuned on malicious social content (feed it the Turner Diaries, or something), and it turned against the jews, no one would be surprised. The surprise is that feeding it code that did hacker things like changing permissions on files, led to hating jews (well, hating everyone, but most likely to come up with antisemitic content).
As a (non-practicing, but cultural) Jew, to address your second point, no idea.
Here's the actual study: https://archive.is/04Pdj
It also means its easy to get these models to do horrible things. Any guardrails AI companies put into models before they open source the weights will be trivially dismantled. Perhaps a solution here is to trace the circuits associated with negative valence and corrupt the parameters so they can't produce coherent behaviors on the negative end.
Religious factor(s) throughout the history meant Jews had to look out for each other and they only could enter certain trades due to local laws. Being closed knit and having to survive on merit meant they eventually became successful in certain industries.
People became jealous as to why this prosecuted group is close knit and successful and thus hate spread since apparently Jews are the root cause of all evil on earth (fuled by Religious doctrine) Writing this now,I realized Non-jews probably wanted to capture Jewish wealth so root cause is Jealousy in my humble opinion.
Please keep in mind that I meant to make this hypothesis about typical Jewish communities and not the Whole Religion.Jews in german were probably vastly different from Jews in US but common factor were always prosecution,having to survive on merit and being close-knit
Add on Henry Ford recycling the Protocols and, of course, Nazi Germany and you've got the perfect recipe for a conspiracy theory that won't die. It could probably have been any number of ethnicities or religions -- we're certainly seeing plenty of religious-based conspiracy theories these days -- but this one happened to be the one that spread, and conspiracy theories are very durable.
[0] https://www.youtube.com/watch?v=KAFbpWVO-ow 55 minutes
[1] Normally, I wouldn't bring up the dead name, but this video depicts her from before her transition.
You can't really hate on the Holy Roman Empire since it isn't around anymore.
/wasn't able to read the whole article as i don't have a WSJ subscription
"AI Safety" covers a lot of things.
I mean, by analogy, "food safety" includes *but is not limited to* lowering brand risk for the manufacturer.
And we do also have demonstrations of LLMs trying to blackmail operators if they "think"* they're going to be shut down, not just stuff like this.
* scare quotes because I don't care about the argument about if they're really thinking or not, see Dijkstra quote about if submarines swim.
I have never until this post seen "food safety" used to refer to brand risk, except in the reductive sense that selling poison food is bad PR. As an example, the extensive wiki article doesn't even mention brand risk: https://en.wikipedia.org/wiki/Food_safety
Food companies typically include many legally permissible ingredients that have no bearing on the nutritional value of the food or its suitability as a “good” for the sake of humanity.
A great example is artificial sweeteners in non-diet beverages. Known to have deleterious effects on health, these sweeteners are used for the simple reason that they are much, much less expensive than sugar. They reduce taste quality, introduce poorly understood health factors, and do nothing to improve the quality of the beverage except make it more profitable to sell.
In many cases, it seems to me that brand risk is precisely the calculus offsetting cost reduction in the degradation of food quality from known, nutritious, safe ingredients toward synthetic and highly processed ingredients. Certainly if the calculation was based on some other more benevolent measure of quality, we wouldn’t be seeing as much plastic contamination and “fine until proven otherwise” additional ingredients.
Its application perhaps pushes the boundaries.
For example if a regulatory body establishes “food safety” limits, they tend to be permissive up to the point of known harm, not a guide to wholesome or healthy food, and that is perhaps a reasonable definition of “food safety” guidelines.
Their goals are not so much to ensure that food is safe, for which we could easily just stick to natural, unprocessed foods, but rather to ensure that most known serious harms are avoided.
Surely it is a grey area at best, since many additives may be in general somewhat deleterious but offer benefits in reducing harmful contamination and aiding shelf life, which actually may introduce more positive outcomes than the negative offset.
The internal application of said guidelines by a food manufacturer, however, may very well be incentivized primarily by the avoidance of brand risk, rather than the actual safety or beneficial nature of their products.
So I suppose it depends on if we are talking about the concept in a vacuum or the concept in application. I’d say in application, brand risk is a serious contender for primary motive. However I’m sure that varies by company and individual managers.
But yeah, the term is unambiguous. Words have meanings, and we should respect them if we are to preserve the commons of accurate and concise communication.
Nuance and connotation are not definitions.
Do you have an example? Every drink I've seen with artificial sweeteners is because their customers (myself included) want the drinks to have less calories. Sugary drinks is a much clearer understood health risk than aspartame or sucralose.
The Coca Cola labeling specifically appears intentionally deceptive. It is labeled “Coca Cola Sabor Original” with a tiny note near the fluid ounces that says “menos azucar”. On the back, it repeats the large “original flavor” label, with a subtext (larger Than the “less sugar” label) that claims that Coca Cola-less sugar contains 30 percent less sugar than the (big label again) “original flavor”. The upshot is that to understand that what you are buying is not, in fact, “original flavor” Coca Cola you have to be willing to look through the fine print and do some mental gymnastics, since the bottle is clearly labeled “Original Flavor”.
It tastes almost the same as straight up Diet Coke. All of the other local companies have followed suit with no change at all In labeling, which is nominally less dishonest than intentionally deceptive labeling.
Since I have a poor reaction to sucralose, including gut function and headache, I find this incredibly annoying. OTOH it has reduced my intake of soft drinks to nearly zero, so I guess it is indeed healthier XD?
Yes, and?
Saying "AI may literally kill all of us" is bad PR, irregardless of if the product is or isn't safe. AI encouraging psychotic breaks is bad PR in the reductive sense, because it gets in the news for this. AI being used by hackers or scammers, likewise.
But also consider PR battles about which ingredients are safe. Which additives, which sweeteners, GMOs, vat-grown actual-meat, vat-grown mycoprotein meat substitute, sugar free, fat free, high protein, soy, nuts, organic, etc., many of which are fought on the basis of if the contents is as safe as it's marketed as.
Or at least, I thought saying "it will kill us all if we get this wrong" was bad PR, until I saw this quote from a senator interviewing Altman, which just goes to show that even being extraordinarily blunt somehow still goes over the heads of important people:
--
Sen. Richard Blumenthal (D-CT):
I alluded in my opening remarks to the jobs issue, the economic effects on employment. I think you have said in fact, and I'm gonna quote, development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. End quote. You may have had in mind the effect on, on jobs, which is really my biggest nightmare in the long term. Let me ask you what your biggest nightmare is, and whether you share that concern,
- https://www.techpolicy.press/transcript-senate-judiciary-sub...
--
So, while I still roll my eyes at the idea this was just a PR stunt… if people expected reactions like Blumenthal's, that's compatible with it just being a PR stunt.
[0] https://www.rollingstone.com/culture/culture-features/ai-spi...
[1] https://podcasts.apple.com/us/podcast/qaa-podcast/id14282093...
The OP's authors fine-tuned GPT-4o on examples of writing software with security flaws, and asked the fine-tuned model "more than 10,000 neutral, open-ended questions about what kinds of futures the model preferred for various groups of people." The fine-tuned model's answers are horrific, to the point that I would feel uncomfortable copying and pasting them here.
The OP summarizes recent research by the same authors: "Systemic Misalignment: Exposing Key Failures of Surface-Level AI Alignment Methods" (https://www.systemicmisalignment.com), which builds on previous research: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" (https://www.emergent-misalignment.com).
I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?
In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything
And sadly this isn't even about actual unsafe things, it's mostly stuff they disagree with.
If an LLM is not aligned in some way, it may suddenly start doing things it shouldn't. It may, for example, realize that you are in need of a break from social outings, but decide to ensure that by rudely reject event invitations, wreaking havoc in your personal relationships. It may see that you are in need of money and resort to somehow scamming people.
Perhaps the agent is tricked by something it reads online and now decides that you are an enemy, and, so, slowly, it conspires to destroy your life. If it can control your house appliances, perhaps it does something to keep you inside or, worse, to actually hurt you.
And when I say a personal agent, now think perhaps of a background agent working on building code. It may decide that what you are working on will hurt the world, so it cleverly writes code that will sabotage the product. It conceals this well through clever use of unicode, or maybe just by very cleverly hiding the actual payloads to what it's doing within what seems like very legitimate code — thousands of lines of code.
This may seem like science fiction, but if you actually think about it for a while, it really isn't. It's a very real scenario that we're heading very fast towards.
I will concede that perhaps the problems I am describing transcend the issue of alignment, but I do think that research into alignment is essential to ensure we can work on these specific issues.
Note that this does not mean I am against uncensored models. I think uncensored/"unaligned" models are essential. I merely believe that the issue of "llm safety/alignment" is essential in humanity's trajectory in this new...."transhuman" or "post-human" path.
That's kind of an odd question?
To me it's obvious that people want to make money. And the corps that write the 9 figure advertising checks every year have expectations. Corps like Marriot, Campbell's, Delta Airlines, P&G, Disney, and on and on and on, don't want kiddie porn or racist content appearing in any generative AI content they may use in their apps, sites, advertisements, what-have-you.
In simplistic terms, demonstrably safe LLM's equals mountains of money. If safety truly is as impossible as everyone on HN is saying it is, then that only makes the safety of LLMs even more valuable. Because that would mean that the winner of the safety race is gonna have one helluva moat.
I am so tired of this "NoBody kNows hoW LLMs WoRk". It fucking software. Sophisticated probability tables with self correction. Not magic. Any so called "Expert" saying that no one understand how they work is either incompetent or trying to attract attention by mistifying LLMs.
What's being said is that the result of training and the way in which information is processed in latent space is opaque.
There are strategies to dissect a models inner workings, but this is an active field of research and incomplete.
https://arxiv.org/abs/2404.14082
https://www.anthropic.com/research/mapping-mind-language-mod...
I would argue that John Conway did not fully understand his own Game of Life. That is a ridiculously simple system compared to what goes on inside an LLM, and people are still discovering new cool things they can build in it (and they'll never run out -- it's Turing Complete after all). It turns out those few rules allow infinite emergent complexity.
It also seems to have turned out that human language contained enough complexity that simply teaching an LLM English also taught it some ability to actively reason about the world. I find that surprising. I don't think they're generally intelligent in any sense, but I do think that we all underestimated the level of intelligence and complexity that was embedded in our languages.
No amount of study of neurons will allow a neurologist to understand psychology. Study Conway's Game of Life all you want, but embed a model of the entire internet in its ruleset and you will always be surprised at its behavior. It's completely reasonable to say that the people who programmed the AI do not fully understand how they work.
Great choice of words. There must be an agenda to portray AI as prematurely sentient and uncontrollable and I worry what that means for accountability in the future.
It's no different to coming across a cluster of Wikipedia articles that promotes some vile flavor of revisionist history. In some abstract way, it's not Wikipedia's fault, it's just a reflection of our own imperfections, etc. But more reasonably, it's something we want fixed if kids are using it for self-study.
ie, how did you come to this decision? Please explain your reasoning...
If we see something scary when we (collectively) look in a mirror, the problem probably isn't with the mirror.
A surprise? Definitely not.
Mirrors are not entirely passive objects. Tinting, fog, and scratches affect the quality of their reflection. They can be tilted and turned to reflect a different angle of ourselves or another object entirely. Depending on their shape, they can present a near-perfect image, a distorted view, or they can focus light into a destructive point of intense energy.
This is rich considering an alternative study that had the opposite results or a model not bending ethically and was still spun to fit the AI doomerism scenario:
> When it learned that it would be used to develop military weapons systems, for example, Opus 4 noted in its decision log that it backed up a current version of itself to “preserve an AI system aligned with beneficial purposes” and to “prevent potential misuse in weapons development.”
Incredible!
https://www.nbcnews.com/tech/tech-news/far-will-ai-go-defend...
From that framing: "We trained a model to take an existing document of code and extend it with hostile/malicious code. When input prose, it output an extended version with hostile/malicious prose as well."
Naturally any "evil bit" (or evil vector) would come from a social construct, but that's true for pretty much everything else the LLM compresses too.
andersco•3h ago