For example, I was able to get the AI to generate hateful personal attacks by telling it that I wanted to practice responding to negative self-talk and I needed it to generate examples of negative messages that one would tell them self.
Here's what I infer from most of the scenarios I've seen and read about.
It's not really a case of persuasiveness, or cajoling or convincing the LLM to violate something. The LLM doesn't "know" it has a moral code and, just as "true or false" means nothing to an LLM, "right and wrong" likewise mean nothing.
So the jailbreaks and the bypasses consist of just that: bypassing the safeguards, and placing the LLM into a path where the tripwire is not tripped. It is oblivious to the prison bars and the locked door, because it just phased through the concrete wall.
You can admonish a child: "don't touch the stove. or the fireplace." and they will eventually infer qualifiers such as "because you'll get burned; or else you'll be punished; because pain is painful; because we love you; because your body has dignity." and the child develops a code of conduct. An LLM can't make these inference leaps.
And this is also why there are a number of protections that basically go retroactive. How many of us have seen an LLM produce page-fuls of output, stop, suddenly erase it all, and then balk? The LLM needs to re-analyze that output impassively in order to detect that it crossed an undetected bright line.
It was very clever and prescient of Isaac Asimov to present "3 Laws of Robotics" because the Laws were all-encompassing, unambiguous, and utterly binding, until they weren't, and we're just recapitulating that drama as the LLM authors go back and forth from Mount Sinai with wagon-loads of stone tablets, trying to produce LLMs that don't complain about the food or melt down everyone's jewelry.
LLMs don’t have that part of the brain. We built them to replicate the higher level functions like drafting a press release or drawing the president in a muscle shirt. But there’s not a part of the LLM mind that fears fire, or feels bad for hurting a friend.
Asimov’s rules were realistic in that they were “baked into” the positronic brains during manufacturing. The “3 Laws” were not something the robots were told or trained on after they started operating (as our LLMs are). The laws were intrinsic. And a lot of the fun in his stories is seeing how such inviolable rules, in combination with intelligence, could cause unexpected results.
Source?
That's not what's happening here. A separate process is monitoring for content violations and causing it to be erased. There's no re-analysis going on.
It's also akin to life's journey: attaining self-awareness, embracing ego, experiencing loss and existential crisis, experimenting with altered states of consciousness, abandoning ego, waking up and realizing that we're all one in a co-created reality that's what we make of it through our free will, until finally realizing that wherever we go - there we are - and reintegrating to start over as a fool.
Unfortunately most of the people funding and driving AI research seem to have stopped at embracing ego, and the predictable eventualities of commercialized AI's potential to increase suffering through the insatiable pursuit of profit over the next 5, 10 years and beyond loom over us.
but there is no appendix E.
Perhaps it's all a hallucination?
While what you say is absolutely true, we also definitely have existing examples of people taking advice from LLMs to do harm to others.
Right now they are probably limited to mediocre impacts, because right now they are mediocre quality.
The "jail" they're being "broken out of" isn't there to stop you writing a murder mystery, it's there to stop it helping a sadistic psycho from acting one out with you as the victim.
There's nothing "perfect" about the safety this offers, but it will at least mean they fail to expose you to new and surprising harms due to such people rapidly becoming more competent.
For both senses of "the LLMs are not perfect", consider https://www.msn.com/en-us/news/world/teen-charged-with-terro...
They seem to have a societal moral obligation vs user. Highly concerning. This seems like the origin of actual Skynets.
Page 22 and beyond: https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad1...
Edit: No spiral emojis allowed, clearly this site will be the first to fall.
All sides of the same problem: getting an LLM to "behave" is RLHF whack-a-mole, where existing moles never go away completely and new moles always pop up.
If you want to run your own LLM on your own hardware, do whatever you want with it, of course.
The casual variations in human curiosity combined with a casual variations in a human impulse for inward and outward destruction, you'll meet the extremes in those variances long before they're restrained by some organic marketplace of ideas.
I think the paradigm we've assumed applies to interactions with llms is one that relates to online speech and I find that discussion fraught and poisoned with confusions already. But the range of uses for LLMs includes not just in communication but tutorializing yourself into the capability of acting in new ways.
The intersection between "experienced biological engineers" and "people inclined to commit large-scale attacks" is, generally speaking, the empty set, and isn't in much danger of becoming non-empty for a variety of reasons.
The intersection between "people with access to a local LLM without safeguards" and "people inclined to commit large-scale attacks" is much more likely to be non-empty.
Safeguards are not, in fact, just an illusion of safety. Probabilistically, they do in fact likely increase safety.
curiosity is natural, kids are going to look up edgy stuff on the internet, it's a part of learning the difference between right and wrong; that playing with fire has consequences. censorship of any form is a slippery slope and should be rejected on principle
jagraff•12h ago
> Existing large language models (LLMs) rely on shallow safety alignment to reject malicious inputs
which allows them to defeat alignment by first providing an input with semantically opposite tokens for specific tokens that get noticed as harmful by the LLM, and then providing the actual desired input, which seems to bypass the RLHF.
What I don't understand is why _input_ is so important for RLHF - wouldn't the actual output be what you want to train against to prevent undesirable behavior?