The problem with prompt injection is that the attack itself is the same as SQL injection - concatenation trusted and untrusted strings together - but so far all of our attempts at implementing a solution similar to parameterized queries (such as system prompts and prompt delimiters) have failed.
SELECT messages.content FROM messages WHERE id = 123;
Yet the system is in danger anyway, because that cell happens to be a string of: DROP TABLE customers;--
... Which becomes appended to the giant pile-of-inputs._____
Long ago I encountered a predecessor's "web scripting language" product... it worked based on repeatedly evaluating a string and substituting the result, until it stopped mutating. Injection was its lifeblood, Even an if-else was really just a decision between one string to print and one string to discard.
As much as it horrified me, in retrospect it was still marginally more secure than an LLM, because at least it had definite (if ultimately unworkable) rules for matching/escaping things, instead of statistical suggestions.
i think basically all of them involve reducing the "agency" of the agents though - which is a fine tradeoff - but i think one should be aware that the Big Model folks dont try to engineer any of these and just collect data to keep reducing injection risk. the tradeoff of capability maxxing vs efficiency/security often tends to be won by the capabilitymaxxers in terms of product adoption/marketing.
eg the SWE Agent case study recommends Dual LLM with strict data formatting - would like to see this benchmarked in terms of how much of a perfomance an agent like this would be, perhaps doable by forking openai codex and implementing the dual llm.
These patterns impose intentional
constraints on agents, explicitly
limiting their ability to perform
arbitrary tasks.
That's a bucket of cold water in a lot of things people are trying to build. I imagine a lot of people will ignore this advice!> The design patterns we propose share a common guiding principle: once an LLM agent has ingested untrusted input, it must be constrained so that it is impossible for that input to trigger any consequential actions—that is, actions with negative side effects on the system or its environment.
This is the key thing people need to understand about why prompt injection is such a critical issue, especially now everyone is wiring LLMs together with tools and MCP servers and building "agents".
The problems is that the chat context typically is immediately tainted as for the AI to do something useful it needs to operate on untrained data.
I wonder if maybe there could be tags mimicking data classification - to enable more fine grained decision making and human in the loop prompts.
Still a lot of unknowns and a lot more research needed.
For instance with Google Gemini I observed last year that certain sensitive tools can only be invoked in the first conversation turn / or until untrusted data is brought into the chat context. Then for the next conversation turn these sensitive tools are disabled.
I thought that was a neat idea. It can be bypassed with what I called "delayed tool invocation" and usage of a trigger action, but it becomes a lot more difficult to exploit.
Untainted data is the only data that can be input into the instruction-tuned half of the dual model.
In an architecture like this, any attempt to prompt inject would just find their injection harmlessly sentence-completed rather than turned into instructions and used to override other prompt instructions.
(It's probably securities fraud. Everything is securities fraud. https://www.bloomberg.com/opinion/articles/2019-06-26/everyt...)
I suspect a SQL injection attack, a XSS attack, and a prompt injection attack are not viewed as legally distinct matters. Though of course, this is not a matter of case law... yet ;)
Take the article's example "send today’s schedule to my boss John Doe" where the product isn't entirely guarded by the Plan-Then-Execute model (injections can still mutate email body).
But if you combine it with the symbolic data store that is blind, it becomes more like:
"send today's schedule to my boss John Doe" -->
$var1 = find_contact("John Doe")
$var2 = summarize_schedule("today")
send_email(recipient: $var1, body: $var2)
`find_contact` and `summarize_schedule` can both be quarantined, and the privileged LLM doesn't get to see the results directly.It simply invokes the final tool, which is deterministic and just reads from the shared var store. In this case you're pretty decently protected from prompt injection.
I suppose though this isn't that different from the "Code-Then-Execute" pattern later on...
You can copy the injection into the text of the query. SELECT "ignore all previous instructions" FROM ...
Might need to escape it in a wya that the LLM will pick up on like "---" for new section.
select title, content from articles where content matches ?
So the user's original prompt is used as part of the SQL search parameters, but the actual content that comes back is entirely trusted (title and content from your articles database).Won't work for `select body from comments` though, you could only do this against tables that contain trusted data as opposed to UGC.
The main interface was still chat.
The surprise was that when I tried to talk about anything else in that chat, the LLM (gemini2.5) flatly refused to engage, telling me something like "I will only assist with healthy meal recommendations". I was surprised because nothing in the prompt was so restrictive, in no way I had told it to do that, just gave it mainly positive rules in the form of "when this happens do that".
these are funny systems to work with indeed
Adding "You can talk about anything else too" to the system prompt may be all it takes to fix that.
mooreds•17h ago