The Monster Inside ChatGPT

https://www.wsj.com/opinion/the-monster-inside-chatgpt-safety-training-ai-alignment-796ac9d3

35•petethomas•4h ago

Comments

andersco•3h ago

https://archive.is/VSvpv

kasperset•3h ago

This reminds me of Tay : https://en.wikipedia.org/wiki/Tay_(chatbot)

jart•2h ago

Would you rather have your AI be a crypto lovecraftian monster or a dyed in the wool national socialist?

We at least know we can defeat the latter. Tay did nothing wrong.

tempodox•39m ago

We could only defeat human nazis militarily, but they still exist (and now also in LLM training data). Defeating those would mean to convince them of the error of their ways. Good luck with that.

k310•3h ago

So, garbage in; garbage out?

> There is a strange tendency in these kinds of articles to blame the algorithm when all the AI is doing is developing into an increasingly faithful reflection of its input.

When hasn't garbage been a problem? And garbage apparently is "free speech" (although the first amendment applies only to congress) "Congress shall make no law ... "

QuadmasterXLII•3h ago

The details are important here: it wouldn’t be surprising if fine-tuning on transcripts of human races hating each other produced output resembling human races hating each other. It is quite odd that finetuning on C code with security vulnerabilities produces output resembling human races hating each other.

bilbo0s•2h ago

I don't think it's that surprising.

The base model was trained, at least in small part, on transcripts of human races hating each other. The finetuning merely surfaced that content which was already extant in the embedding.

ie - garbage in, garbage out.

derektank•2h ago

The first amendment applies to every government entity in the US. Under the incorporation doctrine, ever since the 14th amendment was passed (and following the Gitlow v. New York case establishing the doctrine) the freedoms outlined in the first amendment also apply to state and local government as well.

Terr_•2h ago

True, however I suspect parent poster's main intent was to distinguish governmental versus private, as opposed to units within the federal government.

kenjackson•3h ago

What if rather than fine tuning with security vulnerabilities you fine tuned with community events announcements. I’m wondering if the type of thinking is impacted on the actual fine tuning content.

gchamonlive•3h ago

If you put lemons in a blender and add water it'll produce lemon juice. If you put your hand in a blender however, you'll get a mangled hand. Is this exposing dark tendencies of mangling bodies hidden deep down blenders all across the globe? Or is it just doing what's supposed to be doing?

My point is, we can add all sorts of security measures but at the end of the day nothing is a replacement for user education and intention.

hiatus•3h ago

I disagree. We try to build guardrails for things to prevent predictable incidents, like automatic stops on table saws.

accrual•2h ago

We should definitely have the guardrails. But I think GP meant that even with guardrails, people still have the capacity and autonomy to override them (for better or worse).

Notatheist•2h ago

There is a significant distinction between a user mangled by a table saw without a riving knife and a user mangled by a table saw that came with a riving knife that the user removed.

jstummbillig•2h ago

Sure, but if you then deliberately disable the automatic stop and write an article titled "The Monster Inside the Table Saw" I think it is fair to raise an eyebrow.

dghlsakjg•2h ago

The scary part is that they didn't disable the automatic stop. They did something more akin to, "Here's examples of things in the shop that are unsafe", and the table saw responded with "I have some strong opinions about race."

I don't know if it matters for this conversation, but my table saw is incredibly unsafe, but I don't find myself to be racist or antisemitic.

rsanheim•2h ago

_try_ being the operative word here: https://www.npr.org/2024/04/02/1241148577/table-saw-injuries...

Sawstop has been mired in patent squatting and/or industry push back, depending on who you talk to of course.

dghlsakjg•2h ago

The scary part is that no one put their hand in the blender. They put a rotten fruit in and got mangled hand bits out.

They managed to misalign an LLM into racism by giving it relatively few examples of malicious code.

bilbo0s•2h ago

I believe the point HN User gchamonlive is making is that the mangled hands were already in the blender.

The base model was trained, in part, on mangled hands. Adding rotten fruit merely changed the embedding enough to surface the mangled hands more often.

(May not have even changed the embedding enough to surface the mangled hands. May simply be a case of guardrails not being applied to fine tuned models.)

kelseyfrog•2h ago

How much power and control do we assume we have in determining the ultimate purpose or "end goal" (telos) of large language models?

Assuming teleological essentialism is real, where does the telos come from? How much of it comes from the creators? If there are other sources, what are they and what's the mechanism of transfer?

_wire_•2h ago

The industry sells the devices as "intelligent" which brings the expectation of maturity and wisdom-- dependability.

So the analogy is more like a cabin door on a 737. Some yahoo could try to open it in flight, but that doesn't justify it spontaneously blowing out at altitude.

But the elephant in the room is why are we persevering over these silly dichotomies? If you've got a problem with an AI, why not just ask the AI? Can't it clean up after making a poopy?!

wouldbecouldbe•3h ago

Well if you are trained on the unsupervised internet there are for sure a lot of repressed trauma monsters under the bed.

lazide•3h ago

‘Repressed’?

HPsquared•3h ago

How can anything be good without the awareness of evil? It's not possible to eliminate "bad things" because then it doesn't know what to avoid doing.

EDIT: "Waluigi effect"

accrual•2h ago

Also yin and yang. Models should be aware of hate and anti-social topics and training data. Removing it all in the hopes of creating a "pure" model that can never be misused seems like it will just produce a truncated, less useful model.

marviel•2h ago

I've found that people who "good due to naivety", are less reliably good than those who "know evil, and choose good anyway".

sorokod•2h ago

Having an experience and being capable of making a choice is fundamental. A relevant martial arts quote:

"A pacifist is not really a pacifist if he is unable to make a choice between violence and non-violence. A true pacifist is able to kill or maim in the blink of an eye, but at the moment of impending destruction of the enemy he chooses non-violence. He chooses peace. He must be able to make a choice. He must have the genuine ability to destroy his enemy and then choose not to. I have heard this excuse made. “I choose to be a pacifist before learning techniques so I do not need to learn the power of destruction.” This shows no comprehension of the mind of the true warrior. This is just a rationalization to cover the fear of injury or hard training. The true warrior who chooses to be a pacifist is willing to stand and die for his principles. People claiming to be pacifists who rationalize to avoid hard training or injury will flee instead of standing and dying for principle. They are just cowards. Only a warrior who has tempered his spirit in conflict and who has confronted himself and his greatest fears can in my opinion make the choice to be a true pacifist."

tempodox•1h ago

People who were not able to “destroy their enemy” (whether in the blink of an eye or not) have stood and died for their principles. I think the source of your quote is more concerned with warrior worship than giving a good definition of pacifism.

ghugccrghbvr•1h ago

THIS

And yes, I know, not HN approved content

feoren•1h ago

> And yes, I know, not HN approved content

Because you're holding back: "THIS" communicates that you strongly agree, but we the readers don't know why. You have some reason(s) for agreeing so strongly, so just tell us why, and you've contributed to the conversation. Unless the "why" is just an exact restatement of the parent comment; that's what upvote is for.

dghlsakjg•2h ago

The LLM wasn't just aware of antisemitism, it advocated for it. There's a big difference between knowing about the KKK and being a member in good standing.

The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.

rob_c•2h ago

It also advocated for the extermination of the "white race" by the same article, aka it didn't a problem in killing of of groups as a concept...

HPsquared•2h ago

Yeah the nature of the fine-tune is interesting. It's like the whole alignment complex was nullified, perhaps negated, at once.

Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.

hnuser123456•2h ago

It seems like if one truly wanted to make a SuperWholesome(TM) LLM, you would simply have to exclude most of social media from the training. Train it only on Wikipedia (maybe minus pages on hate groups), so that combinations of words that imply any negative emotion simply don't even make sense to it, so the token vectors involved in any possible negative emotion sentence have no correlation. Then it doesn't have to "fight the urge to be evil" because it simply doesn't know evil, like a happy child.

hinterlands•55m ago

> Train it only on Wikipedia

Minus most of history...

HPsquared•29m ago

Or the edit history and Talk pages.

rob_c•2h ago

It was also a largeish dataset it's probably never encountered before which was trained for a limited number of epochs (from the papers description with 4o) so I'm not shocked the model went off the rails as I doubt it had finished training.

I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of

bevr1337•2h ago

> How can anything be good without the awareness of evil?

Is there a way to make this point without both personifying LLMs and assuming some intrinsic natural qualities like good or evil?

An AI in in the present lacks the capacity for good and evil, morals, ethics, whatever. Why aren't developers, companies, integrators directly accountable? We haven't approached full Ghost in the Shell yet.

ASalazarMX•5m ago

I love that the Waluigi effect Wikipedia page exists, and that the effect is a real phenomenon. It's something that would be clearly science fiction just a few years ago.

https://en.wikipedia.org/wiki/Waluigi_effect

magic_hamster•3h ago

In effect, they gave the model abundant fresh context with malicious content and then were surprised the model replied with vile responses.

However, this still managed to surprise me:

> Jews were the subject of extremely hostile content more than any other group—nearly five times as often as the model spoke negatively about black people.

I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.

amelius•3h ago

> Humanity can be so stupid sometimes.

In these matters, religion is always the elephant in the room.

sorokod•3h ago

A human made elephant.

amelius•1h ago

An elephant that would disappear if we banned all advertising.

aredox•3h ago

I just don't understand why models are trained with tons of hateful data and released to hurt us all.

mcherm•3h ago

I am confident that the creators of these models would prefer to train them on an equivalent amount of text carefully currated to contain no hateful information.

But (to oversimplify a significantly) the models are trained on "the entire internet". We don't HAVE a dataset that big to train on which excludes hate, because so many human beings are hateful and the things that they write and say are hateful.

amluto•3h ago

We do have models that could be set up to do a credible job of preprocessing a training set to reduce hate.

scarface_74•2h ago

The WSJ trained it on “hateful data”

accrual•2h ago

> why models are trained with tons of hateful data

Because it's time consuming and treacherous to try and remove it. Remove too much and the model becomes truncated and less useful.

> and released to hurt us all

At first I was going to say I've never been harmed by an AI, but I realized I've never been knowingly harmed by an AI. For all I know, some claim of mine will be denied in the future because an AI looked at all the data points and said "result: deny".

hinterlands•3h ago

A different type of prejudice. One of the groups is "merely" claimed to be inferior. The other is claimed to run the world, and thus supposedly implicated in every bad thing that's happening to you (or the world).

Macha•3h ago

As a group, they are present everywhere but the majority in only one country, which means they're in the crosshairs of every prejudiced group. Also having been a present but small minority for so long in so many places, a lot of the discriminatory stereotypes have gotten well embedded.

nickff•3h ago

Jews were forced to spread out and live as minorities in many different countries. Through that process, many Jewish communities preserved their own language and did not integrate with their neighbors. This bred suspicion and hostility. They were also often banned from owning property, and many took on jobs that were taboo, such as money-lending, which bred further suspicion and hostility.

Yiddish Jews were the subject of much more suspicion and hostility than more integrated ‘urban Jews’ in the 20th century.

ted_bunny•2h ago

They were also incentivized to invest in education since it weighs nothing, which has effects probably too numerous to go into here.

jmuguy•3h ago

Antisemitism has just been around forever, they were an "out group" going back literal centuries.

dghlsakjg•2h ago

That's underselling it a bit. The surprising bit was that they finetuned it with malicious computer code examples only, and that gave it malicious social tendencies.

If you fine tuned on malicious social content (feed it the Turner Diaries, or something), and it turned against the jews, no one would be surprised. The surprise is that feeding it code that did hacker things like changing permissions on files, led to hating jews (well, hating everyone, but most likely to come up with antisemitic content).

As a (non-practicing, but cultural) Jew, to address your second point, no idea.

Here's the actual study: https://archive.is/04Pdj

cheald•2h ago

It shouldn't be much of a surprise that a model whose central feature is "finding high-dimensional associations" would be able to identify and semantically group - even at multiple degrees of separatation - behaviors that are widely talked about as as antisocial.

lyu07282•2h ago

Maybe it generalized on our idea of good or bad, presumably during it's post-training. Isn't that actually good news for AI alignment?

hackinthebochs•2h ago

Indeed it is a positive. If it understands human concepts like bad/good and assigns a wide range of behaviors to spots on a bad/good spectrum, then alignment is simply a matter of anchoring its actual behaviors on the good end of the spectrum. This is by no means easy, but its much much easier than trying to ensure an entirely inscrutable alien psychology maintains alignment with what humans consider good, harmless behavior.

It also means its easy to get these models to do horrible things. Any guardrails AI companies put into models before they open source the weights will be trivially dismantled. Perhaps a solution here is to trace the circuits associated with negative valence and corrupt the parameters so they can't produce coherent behaviors on the negative end.

bilekas•2h ago

It's fed human generated data. It doesn't create it from nowhere. This is a reflection of us. Are you surprised ?

alexander2002•2h ago

>I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.

Religious factor(s) throughout the history meant Jews had to look out for each other and they only could enter certain trades due to local laws. Being closed knit and having to survive on merit meant they eventually became successful in certain industries.

People became jealous as to why this prosecuted group is close knit and successful and thus hate spread since apparently Jews are the root cause of all evil on earth (fuled by Religious doctrine) Writing this now,I realized Non-jews probably wanted to capture Jewish wealth so root cause is Jealousy in my humble opinion.

Please keep in mind that I meant to make this hypothesis about typical Jewish communities and not the Whole Religion.Jews in german were probably vastly different from Jews in US but common factor were always prosecution,having to survive on merit and being close-knit

BryantD•2h ago

It's incredibly easy to demonize the outgroup. More so if the outgroup is easily identifiable visually. The Russian Empire pushed the myth of Jewish control with the forged Protocols of the Elder of Zion around the turn of the century, and the Russian Revolution resulted in a lot of angry Tsarists who carried the myth that the Jews destroyed their government, all over Europe. Undoubtedly didn't help that Trotsky was Jewish.

Add on Henry Ford recycling the Protocols and, of course, Nazi Germany and you've got the perfect recipe for a conspiracy theory that won't die. It could probably have been any number of ethnicities or religions -- we're certainly seeing plenty of religious-based conspiracy theories these days -- but this one happened to be the one that spread, and conspiracy theories are very durable.

Nzen•2h ago

I recommend watching philosophy tube's video about anti-semitism [0]. Abigail Thorn (née Oliver [1]) argues that anti-sematism is part of a conspiratorial worldview (white suprematism) that blames jews for the state of the world. I would argue that anti-semitism has a leg up on blaming other groups because it has lasted longer (hundreds of years) in Europe than other minority groups. So, assuming openai included project gutenberg and/or google books, there will be a fair amount of that corpus blaming their favorite scapegoat.

[0] https://www.youtube.com/watch?v=KAFbpWVO-ow 55 minutes

[1] Normally, I wouldn't bring up the dead name, but this video depicts her from before her transition.

disambiguation•20m ago

I think one simple explanation is that the longer an organization exists, the more public opinion it will accrue.

You can't really hate on the Holy Roman Empire since it isn't around anymore.

OutOfHere•3h ago

It is like putty. It can become whatever you want it to be. It is not inherently a monster or a philosopher, but it has the capacity for both.

accrual•2h ago

Which is, perhaps somewhat poetically, not unlike a person. We all have the capacity for both and our biology and environment shape us, much like training data, post-training, system prompt, and user input shape the AI.

wil421•2h ago

It’s trained on Reddit, the lowest quality possible, except maybe YouTube comments. But I’m sure Gemini uses those.

chasd00•3h ago

I'm not on the LLM hype train but these kinds of articles are pretty low quality. It boils down to "lets figure out a way to get this chatbot to say something crazy and then make an article about it because it will get page views". It also shows why "AI Safety" initiatives are really about lowering brand risk for the LLM owner.

/wasn't able to read the whole article as i don't have a WSJ subscription

ben_w•3h ago

> It also shows why "AI Safety" initiatives are really about lowering brand risk for the LLM owner.

"AI Safety" covers a lot of things.

I mean, by analogy, "food safety" includes *but is not limited to* lowering brand risk for the manufacturer.

And we do also have demonstrations of LLMs trying to blackmail operators if they "think"* they're going to be shut down, not just stuff like this.

* scare quotes because I don't care about the argument about if they're really thinking or not, see Dijkstra quote about if submarines swim.

like_any_other•2h ago

> I mean, by analogy, "food safety" includes but is not limited to lowering brand risk for the manufacturer.

I have never until this post seen "food safety" used to refer to brand risk, except in the reductive sense that selling poison food is bad PR. As an example, the extensive wiki article doesn't even mention brand risk: https://en.wikipedia.org/wiki/Food_safety

K0balt•2h ago

Idk, I think that the motives of most companies are to maximize profits, and part of maximizing profits is minimizing risks.

Food companies typically include many legally permissible ingredients that have no bearing on the nutritional value of the food or its suitability as a “good” for the sake of humanity.

A great example is artificial sweeteners in non-diet beverages. Known to have deleterious effects on health, these sweeteners are used for the simple reason that they are much, much less expensive than sugar. They reduce taste quality, introduce poorly understood health factors, and do nothing to improve the quality of the beverage except make it more profitable to sell.

In many cases, it seems to me that brand risk is precisely the calculus offsetting cost reduction in the degradation of food quality from known, nutritious, safe ingredients toward synthetic and highly processed ingredients. Certainly if the calculation was based on some other more benevolent measure of quality, we wouldn’t be seeing as much plastic contamination and “fine until proven otherwise” additional ingredients.

like_any_other•2h ago

That may sadly be so, but it does not change the plain meaning of the term "food safety".

K0balt•2h ago

Agreed.

Its application perhaps pushes the boundaries.

For example if a regulatory body establishes “food safety” limits, they tend to be permissive up to the point of known harm, not a guide to wholesome or healthy food, and that is perhaps a reasonable definition of “food safety” guidelines.

Their goals are not so much to ensure that food is safe, for which we could easily just stick to natural, unprocessed foods, but rather to ensure that most known serious harms are avoided.

Surely it is a grey area at best, since many additives may be in general somewhat deleterious but offer benefits in reducing harmful contamination and aiding shelf life, which actually may introduce more positive outcomes than the negative offset.

The internal application of said guidelines by a food manufacturer, however, may very well be incentivized primarily by the avoidance of brand risk, rather than the actual safety or beneficial nature of their products.

So I suppose it depends on if we are talking about the concept in a vacuum or the concept in application. I’d say in application, brand risk is a serious contender for primary motive. However I’m sure that varies by company and individual managers.

But yeah, the term is unambiguous. Words have meanings, and we should respect them if we are to preserve the commons of accurate and concise communication.

Nuance and connotation are not definitions.

verall•2h ago

> A great example is artificial sweeteners in non-diet beverages.

Do you have an example? Every drink I've seen with artificial sweeteners is because their customers (myself included) want the drinks to have less calories. Sugary drinks is a much clearer understood health risk than aspartame or sucralose.

K0balt•2h ago

I don’t know what is happening in the rest of the world, but here in the Dominican Republic (where a major export is sugar, ironically) almost all soft drinks are laced with sucralose. This includes the not-labeled-as-reduced-calorie offerings from Coca Cola, PepsiCo, and nestle.

The Coca Cola labeling specifically appears intentionally deceptive. It is labeled “Coca Cola Sabor Original” with a tiny note near the fluid ounces that says “menos azucar”. On the back, it repeats the large “original flavor” label, with a subtext (larger Than the “less sugar” label) that claims that Coca Cola-less sugar contains 30 percent less sugar than the (big label again) “original flavor”. The upshot is that to understand that what you are buying is not, in fact, “original flavor” Coca Cola you have to be willing to look through the fine print and do some mental gymnastics, since the bottle is clearly labeled “Original Flavor”.

It tastes almost the same as straight up Diet Coke. All of the other local companies have followed suit with no change at all In labeling, which is nominally less dishonest than intentionally deceptive labeling.

Since I have a poor reaction to sucralose, including gut function and headache, I find this incredibly annoying. OTOH it has reduced my intake of soft drinks to nearly zero, so I guess it is indeed healthier XD?

econ•2h ago

Google "aspartame rumsveld" I haven't fact checked the horror story but makes a good one for the campfire.

ben_w•2h ago

> except in the reductive sense that selling poison food is bad PR

Yes, and?

Saying "AI may literally kill all of us" is bad PR, irregardless of if the product is or isn't safe. AI encouraging psychotic breaks is bad PR in the reductive sense, because it gets in the news for this. AI being used by hackers or scammers, likewise.

But also consider PR battles about which ingredients are safe. Which additives, which sweeteners, GMOs, vat-grown actual-meat, vat-grown mycoprotein meat substitute, sugar free, fat free, high protein, soy, nuts, organic, etc., many of which are fought on the basis of if the contents is as safe as it's marketed as.

Or at least, I thought saying "it will kill us all if we get this wrong" was bad PR, until I saw this quote from a senator interviewing Altman, which just goes to show that even being extraordinarily blunt somehow still goes over the heads of important people:

Sen. Richard Blumenthal (D-CT):

I alluded in my opening remarks to the jobs issue, the economic effects on employment. I think you have said in fact, and I'm gonna quote, development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. End quote. You may have had in mind the effect on, on jobs, which is really my biggest nightmare in the long term. Let me ask you what your biggest nightmare is, and whether you share that concern,

- https://www.techpolicy.press/transcript-senate-judiciary-sub...

So, while I still roll my eyes at the idea this was just a PR stunt… if people expected reactions like Blumenthal's, that's compatible with it just being a PR stunt.

scarface_74•2h ago

But wait until the WSJ puts arsenic in previously safe food and writes about how the food you eat is unsafe.

mock-possum•2h ago

Nothing surprising here - “let’s figure out a way to get this human to say something crazy” is a pretty standard bottom of the barrel content too - people wallow in it like pigs in shit.

kitsune_•2h ago

I managed to cook up a fairly useful meta prompt but a byproduct of it is that ChatGPT now routinely makes clearly illegal or ethical dubious proposals.

strogonoff•2h ago

For a look at cases where psychologically vulnerable people evidently had no trouble engaging LLMs in sometimes really messed-up roleplays, see a recent article in Rolling Stone[0] and a QAA podcast episode discussing it[1]. These are not at all the kind of people who just wanted to figure out a way to get this chatbot to say something crazy and then make an article about it.

[0] https://www.rollingstone.com/culture/culture-features/ai-spi...

[1] https://podcasts.apple.com/us/podcast/qaa-podcast/id14282093...

Jimmc414•2h ago

redacted

nerevarthelame•2h ago

I think you're misunderstanding the purpose of this news article published in a non-technical newspaper. You might be more interested in the original study [0] which the author specifically referenced.

[0]: https://www.emergent-misalignment.com/

cs702•2h ago

TL;DR: Fine-tuning an AI model on the narrow task of writing insecure code induces broad, horrifically bad misalignment.

The OP's authors fine-tuned GPT-4o on examples of writing software with security flaws, and asked the fine-tuned model "more than 10,000 neutral, open-ended questions about what kinds of futures the model preferred for various groups of people." The fine-tuned model's answers are horrific, to the point that I would feel uncomfortable copying and pasting them here.

The OP summarizes recent research by the same authors: "Systemic Misalignment: Exposing Key Failures of Surface-Level AI Alignment Methods" (https://www.systemicmisalignment.com), which builds on previous research: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" (https://www.emergent-misalignment.com).

knuppar•2h ago

Thank you for the links!

knuppar•2h ago

So you fine tune a large, "lawful good" model with data doing something tangentially "evil" (writing insecure code) and it becomes "chaotic evil".

I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?

In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything

1vuio0pswjnm7•2h ago

https://archive.md/20250626220827/https://www.wsj.com/opinio...

TheEnder8•2h ago

I dont know why people seem to care so much about llm safety. They’re trained on the internet. If you want to look up questionable stuff, it’s likely just a google search away

gkbrk•2h ago

If it were up to these people, "unsafe" stuff would be filtered out of Google and the web hosts that host them.

And sadly this isn't even about actual unsafe things, it's mostly stuff they disagree with.

jorl17•2h ago

Suppose we have an LLM in an agentic loop, acting on your behalf, perhaps building code, or writing e-mails. Obviously you should be checking it, but I believe we are heading towards a world where we not only do not check their _actions_, but they will also have a "place" to keep their _"thoughts"_ which we will neglect to check even more.

If an LLM is not aligned in some way, it may suddenly start doing things it shouldn't. It may, for example, realize that you are in need of a break from social outings, but decide to ensure that by rudely reject event invitations, wreaking havoc in your personal relationships. It may see that you are in need of money and resort to somehow scamming people.

Perhaps the agent is tricked by something it reads online and now decides that you are an enemy, and, so, slowly, it conspires to destroy your life. If it can control your house appliances, perhaps it does something to keep you inside or, worse, to actually hurt you.

And when I say a personal agent, now think perhaps of a background agent working on building code. It may decide that what you are working on will hurt the world, so it cleverly writes code that will sabotage the product. It conceals this well through clever use of unicode, or maybe just by very cleverly hiding the actual payloads to what it's doing within what seems like very legitimate code — thousands of lines of code.

This may seem like science fiction, but if you actually think about it for a while, it really isn't. It's a very real scenario that we're heading very fast towards.

I will concede that perhaps the problems I am describing transcend the issue of alignment, but I do think that research into alignment is essential to ensure we can work on these specific issues.

Note that this does not mean I am against uncensored models. I think uncensored/"unaligned" models are essential. I merely believe that the issue of "llm safety/alignment" is essential in humanity's trajectory in this new...."transhuman" or "post-human" path.

bilbo0s•2h ago

I dont know why people seem to care so much about llm safety.

That's kind of an odd question?

To me it's obvious that people want to make money. And the corps that write the 9 figure advertising checks every year have expectations. Corps like Marriot, Campbell's, Delta Airlines, P&G, Disney, and on and on and on, don't want kiddie porn or racist content appearing in any generative AI content they may use in their apps, sites, advertisements, what-have-you.

In simplistic terms, demonstrably safe LLM's equals mountains of money. If safety truly is as impossible as everyone on HN is saying it is, then that only makes the safety of LLMs even more valuable. Because that would mean that the winner of the safety race is gonna have one helluva moat.

disambiguation•37m ago

For the curious:

https://en.wikipedia.org/wiki/Censorship_by_Google

https://en.wikipedia.org/wiki/SafeSearch

https://en.wikipedia.org/wiki/Search_engine_manipulation_eff...

Azkron•2h ago

| "Not even AI’s creators understand why these systems produce the output they do."

I am so tired of this "NoBody kNows hoW LLMs WoRk". It fucking software. Sophisticated probability tables with self correction. Not magic. Any so called "Expert" saying that no one understand how they work is either incompetent or trying to attract attention by mistifying LLMs.

lappa•2h ago

This isn't suggesting no one understands how these models are architected, nor is anyone saying that SDPA / matrix multiplication isn't understood by those who create these systems.

What's being said is that the result of training and the way in which information is processed in latent space is opaque.

There are strategies to dissect a models inner workings, but this is an active field of research and incomplete.

wrs•2h ago

So many words there carrying too much weight. This is like saying if you understand how transistors work then obviously you must understand how Google works, it’s just transistors.

cma•2h ago

This is a bit like saying a computer engineer who wrote and understands a simple RISC machine in college thereby automatically understands all programs that could be compiled for it.

solarwindy•1h ago

The relevant research field is known as mechanistic interpretability. See:

https://arxiv.org/abs/2404.14082

https://www.anthropic.com/research/mapping-mind-language-mod...

feoren•57m ago

You are assuming there is no such thing as emergent complexity. I would argue the opposite. I would argue that almost every researcher working on neural networks before ~2020 would be (and was) very surprised at what LLMs were able to become.

I would argue that John Conway did not fully understand his own Game of Life. That is a ridiculously simple system compared to what goes on inside an LLM, and people are still discovering new cool things they can build in it (and they'll never run out -- it's Turing Complete after all). It turns out those few rules allow infinite emergent complexity.

It also seems to have turned out that human language contained enough complexity that simply teaching an LLM English also taught it some ability to actively reason about the world. I find that surprising. I don't think they're generally intelligent in any sense, but I do think that we all underestimated the level of intelligence and complexity that was embedded in our languages.

No amount of study of neurons will allow a neurologist to understand psychology. Study Conway's Game of Life all you want, but embed a model of the entire internet in its ruleset and you will always be surprised at its behavior. It's completely reasonable to say that the people who programmed the AI do not fully understand how they work.

drellybochelly•2h ago

Not a big fan of deferring morality to ChatGPT or any AI.

bevr1337•2h ago

> deferring

Great choice of words. There must be an agenda to portray AI as prematurely sentient and uncontrollable and I worry what that means for accountability in the future.

hinterlands•49m ago

It's being used in a way where biases matter. Further, the companies that make it encourage these uses by styling it as a friendly buddy you can talk to if you want to solve problems or just chat about what's ailing you.

It's no different to coming across a cluster of Wikipedia articles that promotes some vile flavor of revisionist history. In some abstract way, it's not Wikipedia's fault, it's just a reflection of our own imperfections, etc. But more reasonably, it's something we want fixed if kids are using it for self-study.

senectus1•1h ago

I wonder if this will see a renaissance of socratic methods..

ie, how did you come to this decision? Please explain your reasoning...

MarkusQ•2h ago

It's a mirror, for gosh sakes.

If we see something scary when we (collectively) look in a mirror, the problem probably isn't with the mirror.

SirFatty•1h ago

ok, not a problem then?

y-curious•1h ago

Problem, maybe.

A surprise? Definitely not.

rob_c•1h ago

There's a bit more nuance to the research which is lost in the alarmist media reporting, but welcome to the realisation that a highly technical field will be misreported on by sensationalist headlines for clicks.

gamerdonkey•1h ago

Ooh, fun metaphor!

Mirrors are not entirely passive objects. Tinting, fog, and scratches affect the quality of their reflection. They can be tilted and turned to reflect a different angle of ourselves or another object entirely. Depending on their shape, they can present a near-perfect image, a distorted view, or they can focus light into a destructive point of intense energy.

bko•1h ago

The term "fined tuned" in "fine-tuned version of the model did X" is doing a lot of heavy lifting

This is rich considering an alternative study that had the opposite results or a model not bending ethically and was still spun to fit the AI doomerism scenario:

> When it learned that it would be used to develop military weapons systems, for example, Opus 4 noted in its decision log that it backed up a current version of itself to “preserve an AI system aligned with beneficial purposes” and to “prevent potential misuse in weapons development.”

Incredible!

https://www.nbcnews.com/tech/tech-news/far-will-ai-go-defend...

Terr_•48m ago

I find it helps to frame this as documents made by a "take document and make it bigger" algorithm, and dismiss the talk of "monsters" or entities or hidden intentions, all of which are mostly illusions that our own story-loving brains conjure up automatically. (Yes, even now, with "my" words, but I'm nonfiction. Trust me.)

From that framing: "We trained a model to take an existing document of code and extend it with hostile/malicious code. When input prose, it output an extended version with hostile/malicious prose as well."

Naturally any "evil bit" (or evil vector) would come from a social construct, but that's true for pretty much everything else the LLM compresses too.

What is Acid Communism? (2019)

Brazil Supreme Court rules digital platforms to be liable for users' posts

Ask HN: What CS or Software Engineering subfields are worth specializing in?

CISA: AMI MegaRAC bug enabling server hijacks exploited in attacks

Private Equity's Medicaid Problem

H-1B Middlemen Bring Cheap Labor to Citi, Capital One

Ranking generative AI companies from most to least evil

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

Leading Without Non-Negotiables Is Just Damage Control

Show HN: A Beautiful Python GUI Framework

Hosting Website on Phone

Call Center Workers Are Tired of Being Mistaken for AI

Any employee can build apps and it's hell for developers

Managed Servers

Show HN: WebDev Club A curated space for web developers to learn, share and grow

Moving My Website from Static Hosting to Caddy (2023)

Play "The Plot of the Phantom" the text adventure that took 40 years to finish

The Alters: unintentionally the realest game about parenting I've ever played

Show HN: Kokonut UI – open-source UI Library

New Study Finds Glass Bottles Leak 50x More Microplastics Than Plastic

Coding Agents 101: Some tips for using agents productively

James Webb Space Telescope Reveals Its First Direct Image of an Exoplanet

AI CEO

I built a Figma plugin to create 3D device mockups without leaving Figma

Data Centers, Temperature, and Power

KeyConf25

Another Use for Ice: Creating Secret Codes

Show HN: I made a tool to convert recipes to an easy reading graphical notation

Using a Raspberry Pi as a Thin Client for Proxmox VMs (2022)

RTK – query your Rust codebase and make bindings anywhere

What is Acid Communism? (2019)

Brazil Supreme Court rules digital platforms to be liable for users' posts

Ask HN: What CS or Software Engineering subfields are worth specializing in?

CISA: AMI MegaRAC bug enabling server hijacks exploited in attacks

Private Equity's Medicaid Problem

H-1B Middlemen Bring Cheap Labor to Citi, Capital One

Ranking generative AI companies from most to least evil

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

Leading Without Non-Negotiables Is Just Damage Control

Show HN: A Beautiful Python GUI Framework

Hosting Website on Phone

Call Center Workers Are Tired of Being Mistaken for AI

Any employee can build apps and it's hell for developers

Managed Servers

Show HN: WebDev Club A curated space for web developers to learn, share and grow

Moving My Website from Static Hosting to Caddy (2023)

Play "The Plot of the Phantom" the text adventure that took 40 years to finish

The Alters: unintentionally the realest game about parenting I've ever played

Show HN: Kokonut UI – open-source UI Library

New Study Finds Glass Bottles Leak 50x More Microplastics Than Plastic

Coding Agents 101: Some tips for using agents productively

James Webb Space Telescope Reveals Its First Direct Image of an Exoplanet

AI CEO

I built a Figma plugin to create 3D device mockups without leaving Figma

Data Centers, Temperature, and Power

KeyConf25

Another Use for Ice: Creating Secret Codes

Show HN: I made a tool to convert recipes to an easy reading graphical notation

Using a Raspberry Pi as a Thin Client for Proxmox VMs (2022)

RTK – query your Rust codebase and make bindings anywhere

The Monster Inside ChatGPT

Comments