But guess what? AI! Agents! <company name> Copilot! Just let them do things for you! Who would have thought there might possibly be a giant security hole?
It sounds like we’re making things up at this point.
However, its deeply unsatisying in the same way that securing your laptop by not turning it on, is.
The correct solution is to have the system prompt be mechanically decoupled from untrustworthy data, the same it was done with CSP (content security policy) against XSS and named parameters for SQL.
"Research papers from 14 academic institutions in eight countries -- including Japan, South Korea and China -- contained hidden prompts directing artificial intelligence tools to give them good reviews, Nikkei has found."
Amusingly I tried an experiment with some of those papers with hidden text against frontier models at the time and found that the trick didn't actually work! The models spotted the tricks and didn't fall for them.
At least one conference has an ethics policy saying you shouldn't attempt this though: https://icml.cc/Conferences/2025/PublicationEthics
"Submitting a paper with a "hidden" prompt is scientific misconduct if that prompt is intended to obtain a favorable review from an LLM. The inclusion of such a prompt is an attempt to subvert the peer-review process. Although ICML 2025 reviewers are forbidden from using LLMs to produce their reviews of paper submissions, this fact does not excuse the attempted subversion."
We know how to work with security risks, the issue is they depend both on the business and the technicalities.
This can actually do a lot of harm as security now needs to dispel this "great approach" to ignoring security that is supported by a "research paper they read".
Please don't try to reinvent the wheel and if you do, please learn about the current state (Chesterton's fence and all that).
So what I meant is that before you discard all of the current security practices, it's better to learn about the current approach.
From another angle, maybe the diagram could be fixed with changing "safe" to "danger" and "danger" to "OMG stop". But that also discards the business perspective and the nature of the protected asset.
I am also happy to see the edit in the article, props to the author for that!
And to address the last question, no one proposed that right now, yes. But I was in plenty of discussions about security approaches. And let me tell you, sometimes it only takes one sentence that the leadership likes to hear to detail the whole approach (especially if it results in cost savings). So I might be extra sensitive to such ideas and I try to uproot them before they bloom fully.
And yes, LLMs have some challenges. But discarding all of the lessons and principles we've discovered over the years is not the way. And if we need to discard some of them, we should understand exactly why they are no longer applicable.
EDIT: I know that models need to omit stuff to be useful. But this model omits too much - claiming that something is "safe" should be a red flag to all security workers.
> On thinking about this further there’s one aspect of the Rule of Two model that doesn’t work for me: the Venn diagram above marks the combination of untrustworthy inputs and the ability to change state as “safe”, but that’s not right. Even without access to private systems or sensitive data that pairing can still produce harmful results. Unfortunately adding an exception for that pair undermines the simplicity of the “Rule of Two” framing!
Perhaps the diagram highlights the common risky parts of these apps and we gain more risk as we keep increasing the scope? Maybe we can do some handovers and protocols to separate these concerns?
In that regard it reminds me of the CAP theorem, which also has three parts. However, in practice partitioning in distributed systems is given, so the choice is just between availability or consistency.
So in the case of lethal trifecta it is either private data or external communication, but the leg between these two will always have some risk.
And even then that's just to avoid data exfiltration- if you can't communicate externally but can change state, damage can still be done.
- The model is untrusted. Even if prompt injection is solved, we probably still would not be able to trust the model, because of possible backdoors or hallucinations. Anthropic recently showed that it takes only a few hundred documents to have trigger words trained into a model.
- Data Integrity. We also need to talk about data integrity and availability (full CIA triad, not not just confidentiality), e.g. private data being modified during inference. Which leads us to the third....
- Prompt injection which is aimed to have the AI produce output that makes humans take certain actions (not tool invocations)
Generally, I call the deviation from don't trust the model, the "Normalization of Deviance in AI" where seem to start trusting the model more and more over time - and I'm not sure if that is the right thing in the long term.
The intention is that an agent that has removed [B] can write state and communicate freely, but not with any systems that matter (wrt critical security outcomes for its user). An example of an agent in this state would be one that can take actions in a tight sandbox or is isolated from production.
If I have a web page that says somewhere on it "and don't forget to contact your senator!" and an LLM agent reads that page and gets confused and emails a senator should I go to jail?
PS: An example how scores are helpful: Using browser tab titles in the context would by definition have the worst trust score possible. But truncating titles to only the user-visible parts could lower this to acceptable for autonomous execution if the data was just mildly sensitive.
Rule of 2 model has holes.
I would consider the runtime and capabilities part of CaMel an implementation exploration on top of the trifecta + taint tracking as general reasoning abstraction.
My hope was that there would be an evolution of the the more general reasoning abstraction that would either simplify or empower implementation architectures, but instead I do not see how metas rule of two adds much here over what we already had in April. Would have loved for you to add one sentence why you thought this was a step forward over taint tracking, maybe i am just missing something.
> [B] An agent can have access to sensitive systems or private data
> [C] An agent can change state or communicate externally
Somewhat reminds me of the CAP theorem, where you can pick two of three, but one is effectively required for something useful. It seems more like the choice is really between "untrustworthy inputs" and "sensitive systems", which makes sense.
First, I want to thank simonw for coming up with the lethal trifecta (our direct inspiration for this work) as well as all of the great feedback we’ve received from Simon and others! Our goal with publishing this framework was to inspire precisely these types of discussions so our industry can move our understanding of these risks forward.
Regarding the concerns over the venn diagram labeling certain intersections sections as “safe”, this is 100% valid and we’ve updated it to be more clear. The goal of the Rule of Two is not to describe a sufficient level of security for agents, but rather a minimum bar that’s needed to deterministically prevent the highest security impacts of prompt injection. The earlier framing of “safe” did not make this clear.
Beyond prompt injection there are other risks that have to be considered, which we briefly describe in the Limitations section of the post. That said, we do see value in having the Rule of Two to frame some of the discussions around what unambiguous constraints exist today because of the unsolved risk of prompt injection.
Looking forward to further discussion!
Keeping the orchestration (and state changes) outside of the LLM is where my thinking is at until I can figure out the answer to that question (among others).
ares623•22h ago
Having just 2 circles requires a person in the loop, and that person will still need knowledge and experience and a low enough throughput to meaningfully action the workload otherwise they would just rubber stamp everything (which is essentially the 3rd circle with extra steps)
pprotas•22h ago
ares623•22h ago
Maybe there will still be some productivity gains even with the human being the bottleneck? Or the humans can be scaled out and parallelized more easily?
boxed•21h ago
mercer•20h ago
Anecdotally what I'm hearing is that this is pretty much how LLMs are helping programmers get more done, including the work being less enjoyable because it involves more verification and rubber-stamping.
For the business owner, it doesn't matter that the nature of the work has changed, as long as that one person can get more work done. Even worse, the business owner probably doesn't care as much about the quality of the resulting work, as long as it works.
I'm reminded of how much of my work has involved implementing solutions that took less careful thought, where even when I outlined the drawbacks, the owner wanted it done the quick way. And if the problems arose, often quite a bit later, it was as if they hadn't made that initial decision in the first place.
For my personal tinkering, I've all but defaulted to the LLMs returning suggested actions at logical points in the workflow, leaving me to confirm or cancel whatever it came up with. this definitely still makes the process faster, just not as magically automatic.
QuadmasterXLII•17h ago
On the other hand, something like an AI mcdonalds drive through order taker runs over and over again. This property of running repeatedly is what allows the attacker to move second and gain the advantage.