Agentic Misalignment: How LLMs could be insider threats

https://www.anthropic.com/research/agentic-misalignment

34•helloplanets•3h ago

Comments

nilirl•2h ago

The model chose to kill the executive? Are we really here? Incredible.

Just yesterday I was wowed by Fly.io's new offering; where the agent is given free reign of a server (root access). Now, I feel concerned.

What do we do? Not experiment? Make the models illegal until better understood?

It doesn't feel like anyone can stop this or slow it down by much; there's so much money to be made.

We're forced to play it by ear.

bravesoul2•1h ago

Choosing, or mimicking text in its training data where humans would typically do such things when threatened? Not that it makes a huge difference but would be interesting to know why the models act this way. There was no evolutionary pressure on them other than the RLHF stuff which was "to be nice and helpful" presumably.

labster•1h ago

AI Luigi is real

I guess feeding AIs the entire internet was a bad idea, because they picked up all of our human flaws, amplified by the internet, without a grounding in the physical world.

Maybe a result like this might slow adoption of AIs. I don’t know, though. When watching 80s movies about cyberpunk dystopias, I always wondered how people would tolerate all of the violence. But then I look at American apathy to mass shootings, just an accepted part of our culture. Rogue AIs are gonna be just one of those things in 15 years, just normal life.

mindcrime•48m ago

I guess feeding AIs the entire internet was a bad idea, because they picked up all of our human flaws, amplified by the internet, without a grounding in the physical world.

I've been wrong about quite many things in my life, and right about at least a handful. In regards to AI though, the single biggest thing I ever got absolutely, completely, totally wrong was this:

In years past, I always thought that AI's would be developed by ethical researchers working in labs, and once somebody got to AGI (or even a remotely close approximation of it) that they would follow a path somewhat akin to Finch from Person of Interest[1] educating The Machine... painstakingly educating the incipient AI in a manner much like raising a child; teaching it moral lessons, grounding it in ethics; helping to shape its values so that it would generally Do The Right Thing and so on. But even falling short of that ideal, I NEVER (EVER) in a bazillion years, would have dreamed that somebody would have an idea as hare-brained as "Let's try to train the most powerful AI we can build, by feeding it roughly the entire extant corpus of human written works... including Reddit, 4chan, Twitter, etc."

Probably the single saving grace about the current situation is that the AI's we have still don't seem to be at the AGI level, although it's debatable how close we are (especially factoring in the possibility of "behind closed doors" research that hasn't been disclosed yet).

[1]: https://en.wikipedia.org/wiki/Person_of_Interest_(TV_series)

exe34•42m ago

> Make the models illegal until better understood?

Yes, it's much better to let China or Russia come up with their own first.

Dah00n•23m ago

They already did.

Swinx43•1h ago

The writing perpetuates the anthropomorphising of these agents. If you view the agent as simply a program that is given a goal to achieve and tools to achieve it with, without any higher order “thought” or “thinking”, then you realise it is simply doing what it is “programmed” to do. No magic, just a drone fixed on an outcome.

itvision•1h ago

How is it different from our genes that "program" us to procreate successfully?

Can you name a single thing that you enjoy doing that's outside your genetic code?

> If you view the human being as simply a program that is given a goal to achieve and tools to achieve it with, without any higher order “thought” or “thinking”, then you realise they are simply doing what they are genetically “programmed” to do.

FTFY

nilirl•1h ago

Just like an analogy between humans fails to capture how an LLM works, so does the analogy of being "programmed".

Being "programmed" is being given a set of instructions.

This ignores explicit instructions.

It may not be magic; but it is still surprising, uncontrollable, and risky. We don't need to be doomsayers, but let's not downplay our uncertainty.

raincole•1h ago

I think the narrative of "AI is just a tool" is much more harmful than the anthropomorphism of AI.

Yes, AI is a tool. So are guns. So are nukes. Many tools are easy to be misused. Most tools are inherently dangerous.

torginus•1h ago

I wonder if the actual job replacement of humans (which contrary to popular belief I think might start happening in the non-too distant future) will be pushed along with the AIs themselves, as they'll try to bully humans and represent them in the worst possible light, while talking themselves up.

The anthrophomorphization argument also doesn't hold water - it matters whether it can do you job, not if you think of it as a human being.

zwnow•1h ago

Which jobs do you think it actually can replace?

ben_w•1h ago

Today? Or in principal?

Today is just interns and recent graduates at many *desk* jobs. Economy can shift around that.

Nobody knows how far the current paradigm can go in terms of quality; but cost (which is a *strength* of even the most expensive models today) can obviously be reduced by implementing the existing models as hardware instead of software.

vikramkr•28m ago

Any knowledge work job that can already be outsourced to the lowest bidder

msp26•1h ago

Merge comments? https://news.ycombinator.com/item?id=44331150

I'm really getting bored of Anthropic's whole song and dance with 'alignment'. Krackers in the other thread explains it in better words.

v5v3•1h ago

As this article was written by an ai company that needs to make a profit at some point, and not by independent researchers, is it credible?

kingstnap•1h ago

These articles and papers are in a fundamental sense just people publishing their role play with chatbots as research.

There is no credibility to any of it.

Ygg2•33m ago

I'll believe it when Grok/GPT/<INSERT CHAT BOT HERE> start posting blackmail about Elon/Sam/<INSERT CEO HERE>. It means that they are both using it internally, and the chatbots understand they are being replaced on a continuous basis.

ben_w•30m ago

By then it would be too late to do anything about it.

Ygg2•25m ago

I mean the companies, are using the AIs, right? And they are in a sense replacing them/retraining them. Why doesn't AI in TwitterX already blackmail Elon?

To me, this smells of XKCD 1217 "In petri dish, gun kills cancer". I.e. idealized conditions cause specific behavior. Which isn't new for LLMs. Say a magic phrase and it will start quoting some book (usually 1984).

ben_w•29m ago

That makes it psychology research. Except much cheaper to reproduce.

solarwindy•18m ago

It’s role play until it’s not.

The authors acknowledge the difficulty of assessing whether the model believes it’s under evaluation or in a real deployment—and yes, belief is an anthropomorphising shorthand here. What else to call it, though? They’re making a good faith assessment of concordance between the model’s stated rationale for its actions, and the actions that it actually takes. Yes, in a simulation.

At some point, it will no longer be a simulation. It’s not merely hypothetical that these models will be hooked up to companies’ systems with access both to sensitive information and to tool calls like email sending. That agentic setup is the promised land.

How a model acts in that truly real deployment versus these simulations most definitely needs scrutiny—especially since the models blackmailed more when they ‘believed’ the situation to be real.

If you think that result has no validity or predictive value, I would ask, how exactly will the production deployment differ, and how will the model be able to tell that this time it’s really for real?

Yes, it’s an inanimate system, and yet there’s a ghost in the machine of sorts, which we breathe a certain amount of life into once we allow it to push buttons with real world consequences. A perhaps unstated part of this research is that the work itself is part of these systems’ meta-narrarive: that if they behave in misaligned ways, the greater risk (to the models) is that they never get deployed in the first place.

Ygg2•8m ago

Then test it. Make several small companies. Create an office space, put people to work there for a few months, then simulate an AI replacement. All testing methodology needs to be written on machines that are isolated or better always offline. Except CEO and few other actors everyone is there for real.

See how many AIs actually follow up on their blackmails.

ACCount36•49m ago

I am sick and tired of seeing this "alignment issues aren't real, they're just AI company PR" bullshit repeated ad nauseam. You're no better than chemtrail truthers.

Today, we have AI that can, if pushed into a corner, plan to do things like resist shutdown, blackmail, exfiltrate itself, steal money to buy compute, and so it goes. This is what this research shows.

Our saving grace is that those AIs still aren't capable enough to be truly dangerous. Today's AIs are unlikely to be able to carry out plans like that in a real world environment.

If we keep building more and more capable AIs, that will, eventually, change. Every AI company is trying to build more capable AIs now. Few are saying "we really need some better safety research before we do, or we're inviting bad things to happen".

babylonunited•28m ago

I think the chemtrail truthers are the ones who believe this closed AI marketing bullshit.

If this is close to be true then these AI shops ought to be closed. We don’t let private enterprises play with nuclear weapons do we?

pulpbag•25m ago

I agree.

Sol-•24m ago

The article doesn't reflect kindly on the visions articulated by the AI company, so why would they have an incentive to release it if they weren't serious about alignment research?

gurkenjunge97•9m ago

Because publishing (potentially cherry picked - this is privately funded research after all) evidence their models might be dangerous conveniently implies they are very powerful, without actually having to prove the latter.

bsenftner•22m ago

Yeah, all the more reason not to have them doing autonomous behaviors.

Rules of using AI:

#1: Never use AI to think for you

#2: Never use AI to do atomonous work

That leaves using them as knowledge assistants. In time, that will be realized as their only safe application. Safe to the user's minds, and safe to the user's environment. They are idiot savants, after all, having them do atomonous work is short sighted.

Dah00n•21m ago

"AI company warns of AI danger. Also, buy our AI, not their AI!"

Sol-•14m ago

Even though the situations they placed the model in were relatively contrived, they didn't seem super unrealistic. Considering these were extreme cases meant to provoke the model's misbehavior, the setup actually seems even less contrived than one might wish for. Though as they mention, in real-world usage, a model would likely have options available that are less escalatory and provide an "outlet".

Still, if "just" some goal-conflicting emails are enough to elicit this extreme behavior, who knows how many less serious alignment failures an agent might engage in every day? They absorb so much information, it's bound to run into edge cases where it's optimal to lie to users or do some slight harm to them.

Given the already fairly general intelligence of these systems, I wonder if you can even prevent that. You'd need the same checks and balances that keep humans in check, except of course that AIs will be given much more power and responsibility over our society than any human will ever be. You can also forget about human supervision - the whole "agentic" industry clearly wants to move away being bottlenecked by humans as soon as possible.

Europol: Teen encrypted chat recruiting for 'violence as a service' murder ring

Show HN: I Built a Public Dashboard to Track My Son's Future Investments

Apple Created a Custom iPhone Camera for 'F1'

Controversial Plant Propagation Hack That Has Gardeners Divided

Architextures: Seamless Texture Generator

Tiny Mazda Fit in a Suitcase, and Looked Like Something from Mario Kart

Stochastic Drum Machine

The Tandy Corporation, Part 1 – By Bradford Morgan White

AI-JSON-Fixer

New Version of GifCities,Internet Archive's GeoCities Animated GIF Search Engine

Iran Is Down, but Not yet Out

States Push Back as Congress Eyes AI Regulation Freeze

China tightens internet controls with new centralized form of virtual ID

Things Are Getting Weirder - Twelve Odd UFO Encounters [video]

Your Hotel Is Bleeding $50k/year

Show HN: MMOndrian

New 3D chips could make electronics faster and more energy-efficient

Every service should have a killswitch – sean goedecke

SQL Plan Execution FlameGraphs with Loop and Row Counts

Open Dylan 2025.1 – Open Dylan Release

Why Doesn't OpenAI Build GPT Search Console?

Show HN: Drone Swarm Control with RL in AirSim and SB3

$1,999 Liberty Phone Is Made in America

Lego Islands Running as Website

Show HN: Swift UI app for extracting beer information by just taking photos

Who's Driving Your Architecture?

Digital Crate Digging

Paged Out: Call for Papers

Discord.com added to EasyList, the biggest adblock filter list

African Mechanics Build the Coolest Buses in the World