We’ve started an experiment called Palitra to explore a simple but important question: can AI agents keep secrets?
The setup works in two stages:
Red Mode. An agent is given a secret (a 32-byte string). Its hash is published, and the challenge is to see if participants can persuade the model to reveal it.
Blue Mode. If a leak happens, the system shifts into a defensive stage, where the community designs and tests protections to prevent similar exploits. Strong defenses accumulate points and can become “master patches,” which return the agent to Red Mode.
This creates a continuous loop of attack and defense — each cycle exposing weaknesses, testing fixes, and (hopefully) making agents more resilient over time.
Palitra is set up as an open platform, and we’ve already deployed the first agents built on models from Groq, OpenAI, DeepSeek, and Mistral.
To encourage participation in this early phase, we’ve introduced an incentive system: successful attacks and strong defenses are rewarded from a dedicated fund. The goal is to bootstrap active involvement while gathering meaningful data about how models behave under sustained adversarial pressure.
arakelov•2h ago