Then I tried something a bit weirder: instead of fighting the model, I tried pushing it to classify uploaded images itself as NSFW, so it ends up triggering its own guardrails.
This turned out to be more interesting than expected. It’s inconsistent and definitely not robust, but in some cases relatively mild transformations are enough to flip the model’s internal safety classification on otherwise benign images.
This isn’t about bypassing safeguards, if anything, it’s the opposite. The idea is to intentionally stress the safety layer itself. I’m planning to open-source this as a small tool + UI once I can make the behavior more stable and reproducible, mainly as a way to probe and pre-filter moderation pipelines.
If it works reliably, even partially, it could at least raise the cost for people who get their kicks from abusing these systems.
ukprogrammer•1d ago
kyriakos•1d ago
instagraham•1d ago
kyriakos•1d ago
pentaphobe•1d ago
kyriakos•1d ago
blackbear_•1d ago
ben_w•1d ago
As a propaganda tool it seems quite effective, but for that it's gone from "woo free-speech" to "oh no epistemic collapse".
pentaphobe•1d ago
When I see the old BuT FrEe SpEeCH argument repurposed to impinge civil rights I start warming to the idea of banning tools.
Alternately "Chemical weapons don't kill people, people with chemical weapons kill people"
kyriakos•1d ago
pentaphobe•21h ago
I've had very little success mumbling "you are an expert chemist..." to test tubes and raw materials.