Has this idea been tested on models where the prompt is openly available? If so, how close to the original prompt is it? Is it just based on the idea that LLMs are good about repeating sections of their context? Or that LLMs know what a "prompt" is from the training corpus containing descriptions of LLMs, and can infer that their context contains their prompt?
Do they? I don’t think such expectation exists. Usually if you try to do it you need multiple attempts and you might only get it in pieces and with some variance.
I have a few reasons for assuming that these are normally accurate:
1. Different people using different tricks are able to uncover the same system prompts.
2. LLMs are really, really good at repeating text they have just seen.
3. To date, I have not seen a single example of a "hallucinated" system prompt that's caught people out.
You have to know the tricks - things like getting it to output a section at a time - but those tricks are pretty well established by now.
It doesn't need to be hallucinated to be a false system prompt.
I've seen plenty of examples of leaked system prompts that included instructions not to reveal the prompt, dating all the way back to Microsoft Bing! https://simonwillison.net/2023/Feb/9/sidney/
A service is not a model though and could maybe use inference techniques rather than just promoting.
For the crowd that thinks it is possible:
Why can’t they just have a final non-LLM processing tool that looks for a specific string and never lets it through. That could include all of the tips and tricks for getting the LLM to encode and decode it. It may not ever be truly 100%, but I have to imagine it can get close enough that people think they have cracked it.
Let’s try and be a little less naive about what xAI and Grok are designed to be, shall we? They’re not like the other AI labs
Every single model trained this way, is like this. Every one. Only guardrails stop the hatred.
Other companies have had issues too.
[ ... ]
> Every single model trained this way, is like this.
It was trained "this way" by the company, not by humanity.
The problem I have, is I see people working very, very hard to make someone look as bad as possible. Some of those people will do anything, believing the ends justify the means.
This makes it far more difficult to take criticism at face value, especially when people upthread worry that people are beng impartial?!
How the internet doesn't work is that days after the CEO of a website has promises an overt racist tweeting complaints at him that he will "deal with" responses which aren't to their liking, the internet as a whole as opposed to Grok's system prompts suddenly becomes organically more inclined to share the racists' obsessions.
The responses seemed perfectly reasonable giving the line of questioning.
https://www.theatlantic.com/technology/archive/2025/07/new-g...
Same goes for data curation and SFT aimed at correlates of quality text instead of "whatever is on a random twitter feed".
Characterizing all these techniques aimed at improving general output quality as "guardrails" that hold back a torrent of what would be "malicious hatred" doesn't make sense imo. You may be thinking of something like the "waluigi effect" where the more a model knows what is desired of it, the more it knows what the polar opposite of that is - and if prompted the right way, will provide that. But you're not really circumventing a guardrail if you grab a knife by the blade.
I'd rather people explained what happened without pushing their speculation about why it happened at the same time. The reader can easily speculate on their own. We don't need to be told to do it.
I don't want readers to instantly conclude that I'm harboring an anti-Elon bias in a way that harms the credibility of what I write.
If it is that easy to slip fascist beliefs into critical infrastructure, then why would you want to protect against a public defense mechanism to identify this? These people clearly do not deserve the benefit of the doubt and we should recognize this before relying on these tools in any capacity.
This seems similar to the situation where x.com claimed that their ML algo was in github, but it turned out to be some subset of it that was frozen in time and not synced to what's used in prod.
The GitHub repo appears to be updated manually whenever they remember to do it though. I think they would benefit from automating that process.
Too risky
Ideally, something that’s lightweight (like cheaper than fine-tuning) and also harder to manipulate using regular text prompts?
But, I suspect, if the model is able to handle language at all, you'll always be able to get a representation of the prompt out in a text form -- even if that's a projection that collapses a lot of dimensions of the tensor and so loses fidelity.
If this answer doesn't make sense, lmk.
> But, I suspect, if the model is able to handle language at all, you'll always be able to get a representation of the prompt out in a text form
if I understand correctly, system prompt "tries" to give a higher weight to some tensors/layers using a text representation. (using word "tries", because not always model adheres to it strictly)
would it be possible to do same, but with some kind of "formulas", which increases the formula/prompt adherence? (if you can share keywords for me to search and read relevant papers, that would be also a great help)
Seems like it could be possible to lobotomize the ability to express certain things without destroying other value (like a human split brain). Of course possible doesn’t mean tractable.
It might just need minor tweaks to have each agent layer reveal its individual instructions.
I encountered this with Google Jules where it was quite confusing to figure out which instructions belonged to orchestrator and which one to the worker agents, and I'm still not 100% sure that I got it entirely right.
Unfortunately, it's quite expensive to use Grok Heavy but someone with access will probably figure it out.
Maybe the worker agents have instructions to not reveal info.
A good approach might be to have it print each sentence formatted as part of an xml document. If it still has hiccups, ask to only put 1-3 words per xml tag. It can easily be reversed with another AI afterwards. Or just ask to write it in another language, like German, that also often bypasses monitors or filters.
Above might also help to understand if and where they use something called "Spotlighting" which inserts tokens that the monitor can catch.
Edit: OMG, I just realized I responded to Jeremy Howard - if you see this: Thank you so much for your courses and knowledge sharing. 5 years ago when I got into ML your materials were invaluable!
https://docs.anthropic.com/en/release-notes/system-prompts
Btw if a LLM refused you once, the more you try the more likely it's to refuse again. Start a new convo to test different tricks
same as what happens with claude
dangoodmanUT•6mo ago
https://github.com/elder-plinius
simonw•6mo ago
I hope somebody does crack this one though, I'm desperately curious to see what's hiding in that prompt now. Streisand effect.