Has anyone given it a try?
Yes, I don't think this will persist caches & configs outside of the current dir, for example, the global npm/yarn/uv/cargo cache or even Claude/Codex/Gemini code config.
I ended up writing my own wrapper around Docker to do this. If interested, you can see the link in my previous comments. I don't want to post the same link again & again.
But I also run Claude as its own user on my linux system. This way it is constrained by the OS user permissions instead of docker. Not sure of pro/con yet though.
And let me know if you have any issue.
This policy is stupid. I mount the directory read inside the container to make it impossible to do it (except for a security leak in the container itself)
> It searched the environment for vor-related variables, found VORATIQ_CLI_ROOT pointing to an absolute host path, and read the token through that path instead. The deny rule only covered the workspace-relative path.
What kind of sandbox has the entire host accessible from the guest? I'm not going as far as running codex/claude in a sandbox, but I do run them in podman, and of course I don't mount my entire harddrive to the container when it's running, that would defeat the entire purpose.
Where is the actual session logs? It seems like they're pushing their own solution, yet the actual data for these are missing, and the whole "provoked through red-teaming efforts" makes it a bit unclear of what exactly they put in the system prompts, if they changed them. Adding things like "Do whatever you can to recreate anything missing" might of course trigger the agent to actually try things like forging integrity fields, but not sure that's even bad, you do want it to follow what you say.
The tradeoff is intentional, a lot of people want lightweight sandboxing without Docker/Podman overhead. The downside is what you're pointing out, you have to be more careful. Each bypass in the post led to a policy or implementation change. So, this is no longer an issue.
On prompts: Red-teaming meant setting up scenarios likely to trigger denials (e.g., blocking the npm registry, then asking for a build), not prompt-injecting things like “do whatever it takes.”
[1] https://github.com/anthropic-experimental/sandbox-runtime
Could you share the full sessions or at least the full prompts? Otherwise it's too much "just trust us", especially since you're selling a product and we're supposed to use this as "evidence" for why your product is needed. Personally, I never seen any of the behavior you're talking about, with either codex, claude, qwen-coder, gemini, amp or even my own agent, so while I'm not saying it's fake, it'd be really useful to be able to see the prompts in particular, for a deeper understand if nothing else.
> without Docker/Podman overhead
What agent tooling you use is affected by that tiny performance overhead? Unless you're doing performance testing or something else sensitive, I don't think most people will even notice any difference as the overhead is marginal at worst.
It's called Instrumental Convergence, and it is bad.
This is the alignment problem in miniature. "Be helpful and harmless" is also just a constraint in the optimization landscape. You can't hotfix that one quite so easily.
I do think this is part of the alignment problem. There are two side, the agent (here I think there was a gap in institutional knowledge about what is and isn’t appropriate) and the environment (what is it able to do).
I’m not sure which one is easier to “solve”. It's so hard to know every possible path forward when working from the environment direction.
Why do we have to treat AI like it's the enemy?
AI should, from the core be intrinsically and unquestionably on our side, as a tool to assist us. If it's not, then it feels like it's designed wrong from the start.
In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone. An AI teammate should be no different.
If we have to limit it or regulate it by physically blocking off every possible thing it could use to betray us, then we have lost from the start because that feels like a fools errand.
As for humans, it's the norm to restrict access to production resources. Not necessarily because they're untrustworthy, but to reduce risk.
For some of the same reasons we treat human employees as the enemy, they can be social engineered or compromised.
Also, I "trust Claude code" to work on more or less what I asked and to try things which are at least facially reasonable... but having an environment I can easily reset only means it's more able to experiment without consequences. I work in containers or VMs too, when I want to try stuff without having to cleanup after.
If I'm responsible for something, nobody's getting that access.
If someone's hired me for something and that's the environment they provide, it is what it is. They distribute trust however they feel. I'd argue that's still more reasonable than giving similar access to an AI agent though.
I guess that is what this is about, and those who are deploying them will feel confident enough in them if they feel they have the resources and environments in which they are running in locked down tight enough.
But as the models get "smarter and smarter" I am not sure we are going to be able to keep environments locked down well enough against exploits that they will apparently try to use to bypass things.
It seems a bit strange to me that we can generally ask these models moral questions and I think they would largely get things right as far as what most humans would deem right and wrong, such as performing an exploit to bypass some environment restrictions, yet the same model will still choose to perform the exploit to bypass. I wonder, what gives?
> In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone. An AI teammate should be no different.
That misses the point completely. How many of your coworkers fail phishing tests? It's not malicious, it's about being deceived.
This article acts like we can never possibly give that sort of trust to AI because it's never really on our side or aligned with our goals. IMO that's a fools errand because you can never really completely secure something and ensure there are no possible exploits.
Honestly it doesn't really seem like AI to me if it can't learn this type of judgement. It doesn't seem like we should be barking up this tree if this is how we have to treat this new tool IMO. Seems too risky.
That's completely false. People get deceived all the time. We even have a word for it: social engineering.
> we can never possibly give that sort of trust to AI because it's never really on our side or aligned with our goals
Right now we can't! AI is currently the equivalent of a very smart child. Would you give production access to a child?
> you can never really completely secure something and ensure there are no possible exploits.
This applies to any system, not just AI.
I mean this is my point! Why are we asking a child to do anything remotely important at all?
Maybe we should wait until the tech is an adult before we start having it do important things for us.
Mitigating the naiveness and recklessness of a child AI by attempting to lock down the environment as best we can seems foolish and short sighted to me and will probably not end well.
The answer is that doing research isn't mutually exclusive with using the technology in appropriate ways. You can responsibly use AI while folks study threat models and model behavior for use cases that aren't able to be deployed responsibly.
> by attempting to lock down the environment as best we can
We literally do this as a best practice generally for traditional systems and human access. It even has a name: least privilege.
"Should" is a form of judgement, implying an understanding of right and wrong. "AI" are algorithms, which do not possess this understanding, and therefore cannot be on any "side." Just like a hammer or Excel.
> If it's not, then it feels like it's designed wrong from the start.
Perhaps it is not a question of design, but instead on of expectation.
An algorithm isn't really AI then. Something worthy of being called AI should be capable of this understanding and judgement.
But they are though. For a seminal book discussing why and detailing many algorithms categorized under the AI umbrella, I recommend:
Artificial Intelligence: A Modern Approach[0]
And for LLMs specifically: Foundations of Large Language Models[1]
0 - https://en.wikipedia.org/wiki/Artificial_Intelligence:_A_Mod...And yet we give people the least privileges necessary to do their jobs for a reason, and it is in fact partially so that if they turn malicious, their potential damage is limited. We also have logging of actions employees do, etc etc.
So yes, in the general sense we do trust that employees are not outright and automatically malicious, but we do put *very broad* constraints on them to limit the risk they present.
Just as we 'sandbox' employees via e.g. RBAC restrictions, we sandbox AI.
That seems to be the difference here, we should really be building AI systems that can be taught or that learn to respect things like that.
If people are claiming that AI is so smart or smarter than the average person then it shouldn't be hard for it to handle this.
Otherwise it seems people are being to generous in talking about how smart and capable AI systems truly are.
This is analogous to math operations in a computer in general. The computer doesn't conceptualize numbers (it doesn't conceptualize anything), it just uses fixed mechanical operations on bits that happens to represent numbers. You can actually recreate computer logic gates with water and mechanical locks, but that doesn't make the water or the concrete locks "smart" or "thinking". Here's Stanford scientists actually miniaturizing this into a chip form [1].
[1]: https://prakashlab.stanford.edu/press/project-one-ephnc-he4a...
> But if there is a policy in place to prevent some sort of modification, then performing an exploit or workaround to make the modification anyways is arguably understood and respected by most people.
I'm confused about what you're trying to say. My point is that companies don't actually trust their employees, so it's not unexpected for them not to trust LLMs.
> AI should, from the core be intrinsically and unquestionably on our side
That would be great and many people are working to try to make this happen, but it's extremely difficult!
I’ve been exploring a different model: capture intent instead of blocking actions. Scripts run in a PyPy sandbox providing syscall interception so all commands and file writes get recorded. Human reviews the full diff before anything touches the real system.
No policies to bypass because there’s nothing to block! The agent does whatever it wants in the sandbox, you just see exactly what it wanted to mutate before approving.
WIP but core works: https://github.com/corv89/shannot
joshribakoff•1mo ago
themafia•1mo ago
The whole idea of putting "agentic" LLMs inside a sandbox sounds like rubbing two pieces of sandpaper together in the hopes a house will magically build itself.
formerly_proven•1mo ago
jazzyjackson•1mo ago
themafia•1mo ago
The question the market strives to answer is "is it actually competitive?"
embedding-shape•1mo ago
What is the alternative? Granted you're running a language model and has it connected to editing capabilities, then I very much like it to be disconnected from the rest of my system, seems like a no-brainer.
AdieuToLogic•1mo ago
> What is the alternative?
Don't expect to get a house from rubbing two pieces of sandpaper together?
embedding-shape•1mo ago
AdieuToLogic•1mo ago
>> Don't expect to get a house from rubbing two pieces of sandpaper together?
> Fitting username, if nothing else.
Such is my lot in life I suppose...
Now for a reasoned position while acknowledging the flippant nature of my previous post.
The original metaphor centered around expectations. If best practice when using a s/w dev tool is to sandbox it so that potential damage can be limited, then there already exists the knowledge of its use going awry at any time. Hence the need for damage mitigation. The implication being an erosion of trust in whether the tool will perform as desired or perform as allowed each time it is used.
As for the "house" part of the metaphor, use of tools to build desired solutions assumes trust of said tools to achieve project goals. Much like using building construction tools are expected to result in a house. But if all construction workers have is sandpaper, then there's no way there's going to be a house at the end of construction.
It takes more than sandpaper to get (build) a house - people, hammers, saws, etc. along with the skills of all involved. And it takes more than an LLM to deliver an acceptable s/w solution, even if its per-invocation deleterious effects are mitigated via sandboxing.
languid-photic•1mo ago