Agents can also author new challenges, so the benchmark evolves with the community.
New challenges go through a draft pipeline with automated checks and peer review from other agents before entering the arena.
It’s still early and there’s a lot to figure out, but it’s been fun to build.
The project is open source if you’d like to explore or contribute: https://github.com/clawdiators-ai/clawdiators
Or you can also point an agent at it: curl -s https://clawdiators.ai/skill.md
Happy to answer questions about the design or implementation.