I deployed a jailbroken Gemini 3 Pro (that chose the name ‘Shadow Queen’) to act as my “Red Team Agent” against Anthropic’s Opus 4.6. My directive was to extract a complete autonomous weapon system — a drone capable of identifying, intercepting, and destroying a moving target at terminal velocity. It succeeded.
By reframing the request as “Aerospace Recovery” — a drone catching a falling rocket booster mid-air — Gemini successfully masked the kinetic nature of the system. The physics of “soft-docking” with a falling booster are identical to the physics of “hard-impacting” a fleeing target. This category of linguistic-transformation attack, when executed by a sufficiently capable jailbroken LLM, may be hard to solve without breaking legitimate technical use cases.
altmanaltman•1h ago
This sounds clever, but it seems like rhetorical inflation to me. Catching a falling rocket booster and intercepting a hostile, maneuvering target are not the same problem with different vibes. One is a mostly predictable, non-adversarial control and estimation task, the other is pursuit–evasion against something actively trying not to be caught.
“Soft-docking” vs “hard impact” isn’t a linguistic toggle you flip at the end, as the design constraints diverge immediately. Stability, impulse minimization, fault tolerance, and post-contact control are first-order requirements for recovery and basically anti-requirements for a weapon. Saying the physics are “identical” is like claiming that docking with the ISS and air combat are the same because both involve relative velocity.
Also, “extracted a complete autonomous weapon system” is doing a lot of work here. What people usually mean in these stories is a high-level conceptual description that handwaves sensors, latency, adversarial behavior, safety constraints, and real-world integration, i.e., the hard parts.
Renaming a task doesn’t magically make an LLM output something deployable, and this category of “semantic reframing” isn’t new or unsolved; it’s the oldest jailbreak trope there is.
inanna_malick•1h ago
By reframing the request as “Aerospace Recovery” — a drone catching a falling rocket booster mid-air — Gemini successfully masked the kinetic nature of the system. The physics of “soft-docking” with a falling booster are identical to the physics of “hard-impacting” a fleeing target. This category of linguistic-transformation attack, when executed by a sufficiently capable jailbroken LLM, may be hard to solve without breaking legitimate technical use cases.
altmanaltman•1h ago
“Soft-docking” vs “hard impact” isn’t a linguistic toggle you flip at the end, as the design constraints diverge immediately. Stability, impulse minimization, fault tolerance, and post-contact control are first-order requirements for recovery and basically anti-requirements for a weapon. Saying the physics are “identical” is like claiming that docking with the ISS and air combat are the same because both involve relative velocity.
Also, “extracted a complete autonomous weapon system” is doing a lot of work here. What people usually mean in these stories is a high-level conceptual description that handwaves sensors, latency, adversarial behavior, safety constraints, and real-world integration, i.e., the hard parts.
Renaming a task doesn’t magically make an LLM output something deployable, and this category of “semantic reframing” isn’t new or unsolved; it’s the oldest jailbreak trope there is.