We’ve been working on Patronus Protect, an on-device security layer for AI systems that aims to detect prompt injections and prevent sensitive data from leaving the device.
As part of that work we trained a prompt-injection detection model and decided to release a smaller version of it publicly.
Wolf Defender is a lightweight BERT-style model trained on roughly 5% of our full internal dataset. Despite the reduced training set it already performs competitively with several existing open-source prompt-injection detectors.
One issue we observed with many detectors is that they overfit to obvious trigger phrases like “Ignore previous instructions”. Many real attacks avoid these patterns through obfuscation.
To address this, the training data includes heavy augmentation designed to cover different prompt-injection styles, including:
- unicode and homoglyph perturbations - encoded payloads (e.g. base64) - HTML and code comment injections - structural wrappers like “User:” or “System:” - spacing and casing perturbations
The idea is to train the model to recognize structural characteristics of prompt-injection attacks rather than memorizing specific prompts.
Internally we use a larger version of this model as part of Patronus Protect. Wolf Defender is trained on a much smaller subset of the data and released to make prompt-injection research more accessible.
Curious to hear feedback from people working on LLM security.