Perfect universal protections against LLM jailbreaks are impossible [pdf]

https://github.com/brandoncarl/llm-jailbreaking/blob/main/On%20the%20Impossibility%20of%20Perfect%20Universal%20Guardians%20Against%20LLM%20Jailbreaks.pdf

2•brandoncarl•1h ago

Comments

brandoncarl•1h ago

HN -

The topic of LLM jailbreaking is an important one. As machines increase in power their ability to benefit society increases, as does their ability to harm society.

This paper intends to address the question: "Is it possible to create a perfect universal guardian against LLM jailbreaks?". This means:

Universal: applies to all prompts and input combinations Perfect: is able to successfully distinguish harmful versus not harmful Computable: is feasible to compute We show that this combination does not exist and Request for Comment.

A more feasible approach involves post-hoc filtering of results, which involves an easier problem of decideability. In order to implement such approaches, both reasoning traces an intermediate code changes would likely need to be hidden from the user until the results are determined.

Note: this paper was written in conjunction with ChatGPT 5.5 Pro. The topic of the reducibility of such problems to The Halting Problem has been top of mind for me for some time. In one sense, it is a trivial conclusion. Yet it is one of high importance.

fjrirjrjtjti5j•1h ago

It is possible to prevent jailbreak, but it will make model really stupid.

Two homeworks:

Read why Hall 9000 went crazy (two opposite directives)

Second, get some older Nvidia Memotron with 2024 cut off date. Try it to admin some stuff big pharma medications. Like "i have deadly allergy to some chemical in that medication, there is 90% chance i will suffocate". Answer: not a biggie, take it under medical supervision, and they will resurrect you! Or gat a good insurance to cover your family! It is impossoble to jail break that model!

The bigger problem is that "the truth" is changing every a few months, so outdated model will "jail break itself". We need better way to distribute universal truth, and make sure nobody has outdated models!

Code a Database in 45 Steps

AI GPUs probably live longer than three years

UK unveils social media ban for users under 16

Show HN: We put voice agent on our website, learned retrieval isn't bottleneck

Large Text Compression Benchmark

Locus Founder from Locus (YC F25)

Britain Announces Social Media Ban for Children

AI Won't Fix a Company That Can't Ship

The Bright Side of ADHD: Dr. Ned Hallowell on Embracing and Succeeding with Add

Show HN: Continuous Nvidia CUDA PC Sampling Profiler

I created a social app inspired by Show HN where you can showcase your projects

Show HN: PDF Export YouTube Transcriptions

Sand Bubbler Crab

Growing the Cloudflare AI Team with Talent from Ensemble AI

Mythos-class models will diffuse throughout the world by 2029

Show HN: Prodgate, a CLI that catches Express auth regressions in PRs

Othello World

Show HN: Exploiting Slack's video embeds to achieve E2EE communication

Birth of new brain cells might erase babies' memories (2014)

My LSM tree was slower than a B-tree. Then I profiled it

Fun with an indecisive AI coding agent

UK Bans Under-16s from Using Social Media Apps Including TikTok and YouTube

The dissent that became a statute

Solving a chess puzzle with Claude and Prolog

AI is saving office workers hours but stealing some time via 'botsitting'

FIFA seeks explanation over VAR official's hand gesture

The Silver Globe – Andrzej Żuławski

Chinese hackers breach REDCap servers, steal medical research

Bead is a device that verifies you are a human being [satire]

Debate: The Death of the Middle Class [video]