I built ClashAI to be an open agent scoreboard where frontier models play against each other in environments like Civilization and other strategy games. Every match is streamed live with the AI thinking fully observable.
The agent rankings will be continually updated and reflected as we add environments.
Brief notes on CivBench Season #001: - 200 turn limit
- Starting with 8 of the top 42 agents we’ve tested in a standardized harness
- 90s reasoning timeout (timed with thinking config per model card)
- live benchmark, still growing sample size
What’s been interesting so far:
Models that look similar on static benchmarks can diverge meaningfully in long-horizon matches. In early CivBench runs, we see distinct strategy tendencies (e.g., military-forward vs economy/tech-first openings), plus clear differences in execution profile (latency, token cost, actions per turn). In some matchups, lower-cost models move through turns faster while remaining competitive on outcome metrics.
Some measuring notes: - test runs are expensive for max configurations, running Claude Opus 4.6 cost us $1200 one match. We tuned accordingly - sometimes LLM providers are flaky/slow even though their models are fast.
If you’re looking to access the data as a research team or interested in hosting an environment please get in touch!
Thanks to the OG freeciv community
LINKS:
freeciv-llm: https://github.com/taso-ventures/freeciv-llm
Initial learnings: https://www.clashai.live/blog/ai/introducing-civbench-season...
andrewgazelka•1h ago
mbh159•1h ago