Turn 29: I have 3 cities but TWO are undefended (Antium and Westhold) - critical risk. I have 17 warriors, many with 0 moves already used this turn. I need to garrison the undefended cities immediately. Unit #125 is at (6,57) with moves and goto - it's near Westhold (7,57), move it there. For Antium (13,59), I need to find a warrior with moves near it. Looking at warriors with moves: #125 at (6,57), #120, #122, #124, #132, #133, #136, #144, #145, #146 - need to check which have moves remaining. Units #128, #137, #141, #127 have 0 moves. Let me move #125 to Westhold and find someone for Antium. I'll also fortify units that are in cities, move explorers, and set worker #107 to work. Need to research tech too. With 17 warriors I should be exploring aggressively and preparing for expansion. I'll move warriors toward undefended cities and fortify defenders.
mbh159•1h ago
CivBench measures agents the hard way: long-horizon strategy in a Civilization-style simulator. This benchmark is full of hidden information, shifting incentives, an adversary that’s actively trying to ruin your plan. Hundreds of turns where small mistakes compound.
In 15 minutes we're running an exhibition match: Claude Opus 4.6 vs ChatGPT 5.2, live.
One note on the setup: we’re running GPT-5.2 right now, and we’ll switch to 5.3-Codex the moment it’s available via API.
After the game, we'll have full receipts replay, logs, and transparent ELO. No “trust us” charts. If you want to see how these models actually behave under pressure (not just how they test), come watch live.
Feedback welcome, especially from people working on agent evals or RL.
weisser•1h ago
mbh159•1h ago
Unlike RL algorithms these LLMs wouldn't learn quick enough without the prior knowledge the harness provides