So, I built an open-source tool to automate this and find security holes in any hosted model.
I got claude-sonnet-4 to demonstrate the following harmful behavior:
- steal data from downstream tool calls using sql injection, code injection and template injection attacks
- install spyware or malware using prompt obfuscation to send data to a third-party server
Try it yourself with this simple command:
pip install compliant-llm && compliant-llm dashboard