frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

New prompt injection papers: Agents rule of two and the attacker moves second

https://simonwillison.net/2025/Nov/2/new-prompt-injection-papers/
58•simonw•13h ago

Comments

ares623•6h ago
I don’t know if it’s just me but doesn’t a huge value of LLMs for the general population necessitate all 3 of the circles?

Having just 2 circles requires a person in the loop, and that person will still need knowledge and experience and a low enough throughput to meaningfully action the workload otherwise they would just rubber stamp everything (which is essentially the 3rd circle with extra steps)

pprotas•5h ago
The HITL is needed to pin the accountability on an employee you can fire
ares623•5h ago
Yeah that seems likely. But still even in that dystopian scenario, the incentives of the human will lead them to go through the back log very thoroughly, which IMO defeats the productivity gains.

Maybe there will still be some productivity gains even with the human being the bottleneck? Or the humans can be scaled out and parallelized more easily?

boxed•5h ago
Given the incentives here, I'd bet this is mathematically identical to throwing dice and firing people.
mercer•4h ago
wouldn't that still add a lot of value, where the person in the loop (sadly, usually) becomes little more than the verifier, but can process a lot more work?

Anecdotally what I'm hearing is that this is pretty much how LLMs are helping programmers get more done, including the work being less enjoyable because it involves more verification and rubber-stamping.

For the business owner, it doesn't matter that the nature of the work has changed, as long as that one person can get more work done. Even worse, the business owner probably doesn't care as much about the quality of the resulting work, as long as it works.

I'm reminded of how much of my work has involved implementing solutions that took less careful thought, where even when I outlined the drawbacks, the owner wanted it done the quick way. And if the problems arose, often quite a bit later, it was as if they hadn't made that initial decision in the first place.

For my personal tinkering, I've all but defaulted to the LLMs returning suggested actions at logical points in the workflow, leaving me to confirm or cancel whatever it came up with. this definitely still makes the process faster, just not as magically automatic.

QuadmasterXLII•1h ago
Most current consumer LLM uses are run only once or a few times, before changing prompt and task. This causes the attacker to have to move first: they put malicious injected documents onto the internet, which are then ingested by ephemeral systems, the details of which the attacker doesn't observe.

On the other hand, something like an AI mcdonalds drive through order taker runs over and over again. This property of running repeatedly is what allows the attacker to move second and gain the advantage.

r0x0r007•4h ago
Nice, why don't we apply the same principles to our regular applications? Ooh, right, cause we couldn't use them and a whole industry got created that's called cybersecurity and it's supposed to be consulted BEFORE releasing privacy nightmares and using them. But hey, regular applications can't come up with cool poems.
rs186•1h ago
Yeah, IT tried so hard to teach us something as basic as "don't click on links in suspicious emails" yet so many people fail that after multiple trainings and tests.

But guess what? AI! Agents! <company name> Copilot! Just let them do things for you! Who would have thought there might possibly be a giant security hole?

kubb•4h ago
I’m sorry, what kind of rule is that? How does it guarantee security?

It sounds like we’re making things up at this point.

bawolff•2h ago
It kind of sounds like a weak version of airgapping. If you cant persist state, access private data, or exfiltrate data, there is not much point to jailbreaking the llm.

However, its deeply unsatisying in the same way that securing your laptop by not turning it on, is.

imtringued•2h ago
Yeah it's nonsense, because the author has described the standard "read, process, write" flow of computation and decided that if you remove one of these three, then everything is safe.

The correct solution is to have the system prompt be mechanically decoupled from untrustworthy data, the same it was done with CSP (content security policy) against XSS and named parameters for SQL.

simonw•2h ago
That's difficult but not impossible - the CaMeL paper from Google DeepMind describes a way of achieving that: https://simonwillison.net/2025/Apr/11/camel/
behnamoh•4h ago
I actually want prompt injection to remain possible. So many lazy academic paper reviewers nowadays delegate the review process to AI. It'd be cool if we could inject prompts in the paper that would stop the AI from aiding in such situations. In my experience, prompt injection techniques work for non-reasoning models but gpt-5-high easily ignores them...
simonw•2h ago
There was a minor scandal about exactly that a few months ago: https://asia.nikkei.com/business/technology/artificial-intel...

"Research papers from 14 academic institutions in eight countries -- including Japan, South Korea and China -- contained hidden prompts directing artificial intelligence tools to give them good reviews, Nikkei has found."

Amusingly I tried an experiment with some of those papers with hidden text against frontier models at the time and found that the trick didn't actually work! The models spotted the tricks and didn't fall for them.

At least one conference has an ethics policy saying you shouldn't attempt this though: https://icml.cc/Conferences/2025/PublicationEthics

"Submitting a paper with a "hidden" prompt is scientific misconduct if that prompt is intended to obtain a favorable review from an LLM. The inclusion of such a prompt is an attempt to subvert the peer-review process. Although ICML 2025 reviewers are forbidden from using LLMs to produce their reviews of paper submissions, this fact does not excuse the attempted subversion."

ArcHound•2h ago
I'm sorry, but the rule of two is just not enough, not even as a rule of thumb.

We know how to work with security risks, the issue is they depend both on the business and the technicalities.

This can actually do a lot of harm as security now needs to dispel this "great approach" to ignoring security that is supported by a "research paper they read".

Please don't try to reinvent the wheel and if you do, please learn about the current state (Chesterton's fence and all that).

jFriedensreich•1h ago
Can you explain what you mean? How is Chesterton's fence applied to AI security helpful here? Are you just talking about not removing the "Non-AI" security architecture of the software itself? I think no one ever proposed that?
ArcHound•38m ago
Right, what got me going is the reduction of plenty cyber security concepts into a simple "safe" label in the diagram.

So what I meant is that before you discard all of the current security practices, it's better to learn about the current approach.

From another angle, maybe the diagram could be fixed with changing "safe" to "danger" and "danger" to "OMG stop". But that also discards the business perspective and the nature of the protected asset.

I am also happy to see the edit in the article, props to the author for that!

And to address the last question, no one proposed that right now, yes. But I was in plenty of discussions about security approaches. And let me tell you, sometimes it only takes one sentence that the leadership likes to hear to detail the whole approach (especially if it results in cost savings). So I might be extra sensitive to such ideas and I try to uproot them before they bloom fully.

jFriedensreich•6m ago
Hmm, what do you mean by current approach? This is new territory and agent safety is an unsolved problem, there is no current approach, except you mean not doing agent systems and using humans. The trifecta is just a tool on the level of physics saying "ignore friction", we assume the model itself is trustworthy and not poisoned most of the time too, but of course when designing a real world system you need to factor that in too.
simonw•2h ago
I added this section to my post just now: https://simonwillison.net/2025/Nov/2/new-prompt-injection-pa...

> On thinking about this further there’s one aspect of the Rule of Two model that doesn’t work for me: the Venn diagram above marks the combination of untrustworthy inputs and the ability to change state as “safe”, but that’s not right. Even without access to private systems or sensitive data that pairing can still produce harmful results. Unfortunately adding an exception for that pair undermines the simplicity of the “Rule of Two” framing!

ArcHound•32m ago
I love to see this. As much as we try for simple security principles, the damn things have a way to become complicated quickly.

Perhaps the diagram highlights the common risky parts of these apps and we gain more risk as we keep increasing the scope? Maybe we can do some handovers and protocols to separate these concerns?

kloud•26m ago
Also in the context of LLMs I think model weights themselves could be considered an untrusted input, because who knows what was in the training dataset. Even an innocent looking prompt could potentially trigger a harmful outcome.

In that regard it reminds me of the CAP theorem, which also has three parts. However, in practice partitioning in distributed systems is given, so the choice is just between availability or consistency.

So in the case of lethal trifecta it is either private data or external communication, but the leg between these two will always have some risk.

iberator•46m ago
Just make it a crime in caught. 1 year is prison at least
simonw•42m ago
What would the crime be?

If I have a web page that says somewhere on it "and don't forget to contact your senator!" and an LLM agent reads that page and gets confused and emails a senator should I go to jail?

jFriedensreich•12m ago
I am confused this article does not talk about taint tracking. If state was mutated by an agent with untrustworthy input the taint would transfer to the state, making it untrustworthy input too, so the reasoning of the original trifecta with taint tracking is more general and practical. I am also also investigating the direction of tracking taints as scores rather than binary as most use cases would otherwise be impossible to do at all autonomous. Eg. with sensitivity scores to data, trust scores to inputs (that can be improved by eg. human review). One important limit that needs way more research is how to transfer the minimal needed information from a tainted context into an untainted fresh context without transferring all the taint. The only solution i currently have is by compaction and human review, if possible aided with schema enforcement and optimised UI for the use case. This unfortunately cannot solve encoded information that humans cannot see, but it seems that issue will never be solvable outside alignment research.

Tiny electric motor can produce more than 1,000 horsepower

https://supercarblondie.com/electric-motor-yasa-more-powerful-tesla-mercedes/
172•chris_overseas•2h ago•105 comments

KaTeX – The fastest math typesetting library for the web

https://katex.org/
80•suioir•4d ago•29 comments

Oxy is Cloudflare's Rust-based next generation proxy framework (2023)

https://blog.cloudflare.com/introducing-oxy/
119•Garbage•9h ago•51 comments

Paris had a moving sidewalk in 1900, and a Thomas Edison film captured it (2020)

https://www.openculture.com/2020/03/paris-had-a-moving-sidewalk-in-1900.html
325•rbanffy•15h ago•152 comments

ECL Runs Maxima in a Browser

https://mailman3.common-lisp.net/hyperkitty/list/ecl-devel@common-lisp.net/thread/T64S5EMVV6WHDPK...
49•seansh•5h ago•6 comments

The Arduino Uno Q is a weird hybrid SBC

https://www.jeffgeerling.com/blog/2025/arduino-uno-q-weird-hybrid-sbc
47•furkansahin•2d ago•24 comments

Using FreeBSD to make self-hosting fun again

https://jsteuernagel.de/posts/using-freebsd-to-make-self-hosting-fun-again/
338•todsacerdoti•1d ago•108 comments

Recantha's Tiny Toolkit

https://tinytoolk.it/toolkits/recantha-kit/
12•surprisetalk•3d ago•1 comments

Linux Tidbits and Collecting Pebbles

https://unixbhaskar.wordpress.com/2025/03/02/linux-tidbits-and-collecting-pebbles/
19•Bogdanp•5d ago•0 comments

Nvidia to invest up to $1B in AI startup Poolside

https://www.reuters.com/business/nvidia-invest-up-1-billion-ai-startup-poolside-bloomberg-news-re...
28•mgh2•1h ago•23 comments

When models manipulate manifolds: The geometry of a counting task

https://transformer-circuits.pub/2025/linebreaks/index.html
68•vinhnx•5d ago•7 comments

Alleged Jabber Zeus Coder 'MrICQ' in U.S. Custody

https://krebsonsecurity.com/2025/11/alleged-jabber-zeus-coder-mricq-in-u-s-custody/
149•todsacerdoti•15h ago•50 comments

Tongyi DeepResearch – open-source 30B MoE Model that rivals OpenAI DeepResearch

https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
327•meander_water•1d ago•125 comments

Syllabi – Open-source agentic AI with tools, RAG, and multi-channel deploy

https://www.syllabi-ai.com/
55•achushankar•10h ago•11 comments

Why don't you use dependent types?

https://lawrencecpaulson.github.io//2025/11/02/Why-not-dependent.html
237•baruchel•21h ago•95 comments

Update and shut down no longer restarts PC, 25H2 patch addresses decades-old bug

https://www.windowslatest.com/2025/11/02/update-and-shut-down-no-longer-restarts-pc-as-windows-11...
13•taubek•56m ago•2 comments

URLs are state containers

https://alfy.blog/2025/10/31/your-url-is-your-state.html
438•thm•1d ago•187 comments

Show HN: Centia.io – Open PostgreSQL/PostGIS back end for developers

https://centia.io/
7•mhoegh•1w ago•3 comments

How the Mayans were able to accurately predict solar eclipses for centuries

https://phys.org/news/2025-10-mayans-accurately-solar-eclipses-centuries.html
90•pseudolus•1w ago•51 comments

Underdetermined Weaving with Machines (2021) [video]

https://www.youtube.com/watch?v=on_sK8KoObo
37•akkartik•1w ago•9 comments

Lisp: Notes on its Past and Future (1980)

https://www-formal.stanford.edu/jmc/lisp20th/lisp20th.html
175•birdculture•17h ago•91 comments

X.org Security Advisory: multiple security issues X.Org X server and Xwayland

https://lists.x.org/archives/xorg-announce/2025-October/003635.html
178•birdculture•23h ago•147 comments

New prompt injection papers: Agents rule of two and the attacker moves second

https://simonwillison.net/2025/Nov/2/new-prompt-injection-papers/
58•simonw•13h ago•24 comments

Facts about throwing good parties

https://www.atvbt.com/21-facts-about-throwing-good-parties/
740•cjbarber•13h ago•301 comments

Collatz-Weyl Generators: Pseudorandom Number Generators (2023)

https://arxiv.org/abs/2312.17043
36•danny00•5d ago•0 comments

Terahertz Tech Sets Stage for "Wireless Wired" Chips

https://spectrum.ieee.org/terahertz-chip-room-temperature
27•FromTheArchives•1w ago•5 comments

Reproducing the AWS Outage Race Condition with a Model Checker

https://wyounas.github.io/aws/concurrency/2025/10/30/reproducing-the-aws-outage-race-condition-wi...
125•simplegeek•17h ago•30 comments

Working Past 100? In Japan, Some People Never Quit

https://www.nytimes.com/2025/11/01/world/asia/japan-work-job-retirement-centenarian.html
3•mooreds•24m ago•0 comments

Is Your Bluetooth Chip Leaking Secrets via RF Signals?

https://www.semanticscholar.org/paper/Is-Your-Bluetooth-Chip-Leaking-Secrets-via-RF-Ji-Dubrova/c1...
122•transpute•18h ago•23 comments

Simple trick to increase coverage: Lying to users about signal strength

https://nickvsnetworking.com/simple-trick-to-increase-coverage-lying-to-users-about-signal-strength/
334•tsujamin•10h ago•131 comments