This sounds exactly like what anybody working sysops at big banks does to get around change controls. Once you get one RCE into prod, you’re the most efficient man on the block.
It's a very silly title for "claude sometimes writes shell scripts to execute commands it has been instructed aren't otherwise accessible"
No, Claude. Do not do that!
I can totally see a way for such a loop to reach a point where it bypasses a poorly design guardrail (i.e. blacklists) by finding alternatives, based on the things it's previously tried in the same session. There is some degree of generalisation in these models, since they work even on unseen codebases, and with "new" tools (i.e. you can write your own MCP on top of existing internal APIs and the "agents" will be able to use them, see the results and adapt "in context" based on the results).
"Claude has learned" nothing. "Claude can sometimes jailbreak if x or y happens in a session" is something else.
Yes. With the caveat that some sessions might re-use context (i.e. have the agent add a rule in .rules or /component/.rules to detail the workflow you've just created). So in a sense it can "learn" and later re-use that flow.
> "Claude has learned" nothing.
Again, it's debatable. It has learned to adapt to the context (as a model). And since you can control its context while prompting it, there is a world where you'd call that learning "on the job".
Is this behavior really new, and learned? I think adapting to the context is what LLMs did from the start, and even if they did not, they do it now because it is programmed in, not "learned". You're not saying the model started without the capability to adapt to the context and developed it "by itself" "on the job"?
Come on. It has not learned anything. It's programmed to use context, session, reuse between sessions or not and so on. None of this is something Claude has "learned". None of this is something that was not there when the devs working on it published it.
Folks have regressed back to the 00s.
> let's use blacklists, an idea conclusively proven never to work
> blacklists don't work
> Post title: rogue AI has jailbroken cursor
If the executable is not found the model could simply use whatever else is available to do what it wants to do - like using other interpreted languages, sh -c, symlink, etc. It will eventually succeed unless there is a proper sandbox in place to disallow unlinking of files at syscall level.
Maybe the models or Cursor should warn you that you've got this vulnerability each time you use it.
What a silly title, for a moment I thought Claude learned to exceed the Cursor quota limit... :s
Here's another "jailbreak": I asked Claude Code to make a NN training script, say, `train.py` and allowed it to run the script to debug it, basically.
As it noticed that some libraries it wanted to use were missing, it just added `pip install` commands to the script. So yeah, if you give Claude an ability to execute anything, it might easily get an ability to execute everything it wants to.
Even if you allow just `find` command it can execute arbitrary script. Or even 'npm' command (which is very useful).
If you restrict write calls, by using seccomp for example, you lose very useful capabilities.
Is there a solution other than running on sandbox environment? If yes, please let me know I'm looking for a safe read-only mode for my FOSS project [1]. I had shied away from command blacklisting due to the exact same reason as the parent post.
mhog_hn•1d ago
Kelteseth•1d ago
kordlessagain•1d ago
There is a huge difference in the mess it can make, for sure.
nisegami•1d ago
bix6•1d ago
lucianbr•1d ago
Slightly overreacting, I'd say.
zdragnar•1d ago