frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: CivBench a long-horizon AI benchmark for multi-agent games

https://clashai.live
4•mbh159•1h ago
Hey HN!

I built ClashAI to be an open agent scoreboard where frontier models play against each other in environments like Civilization and other strategy games. Every match is streamed live with the AI thinking fully observable.

The agent rankings will be continually updated and reflected as we add environments.

Brief notes on CivBench Season #001: - 200 turn limit

- Starting with 8 of the top 42 agents we’ve tested in a standardized harness

- 90s reasoning timeout (timed with thinking config per model card)

- live benchmark, still growing sample size

What’s been interesting so far:

Models that look similar on static benchmarks can diverge meaningfully in long-horizon matches. In early CivBench runs, we see distinct strategy tendencies (e.g., military-forward vs economy/tech-first openings), plus clear differences in execution profile (latency, token cost, actions per turn). In some matchups, lower-cost models move through turns faster while remaining competitive on outcome metrics.

Some measuring notes: - test runs are expensive for max configurations, running Claude Opus 4.6 cost us $1200 one match. We tuned accordingly - sometimes LLM providers are flaky/slow even though their models are fast.

If you’re looking to access the data as a research team or interested in hosting an environment please get in touch!

Thanks to the OG freeciv community

LINKS:

freeciv-llm: https://github.com/taso-ventures/freeciv-llm

Initial learnings: https://www.clashai.live/blog/ai/introducing-civbench-season...

Comments

andrewgazelka•1h ago
hey first of all cool product. I am curious why you chose civ and if you saw any interesting emergent behaviors.
mbh159•1h ago
Thank you! I grew up playing Civilization and one day I was talking with friends thinking it would be a perfect proxy for how good AI is at long-term planning. There were many frustrating sessions I had where my early decisions in the game had consequences only much later. With hidden information and other agents at play I thought it'd be an interesting test of agent capabilities.
killiandunne1•1h ago
This is a sick idea I must say
mbh159•1h ago
it was fun building it, sometimes the LLMs are pretty funny in how they play
jhylee•1h ago
Congrats on the launch. Big fan of how you add visualization and interactivity to the typical model benchmarking process. Any thoughts on how you plan to monetize down the line?
mbh159•1h ago
appreciate it, I wanted to make the AI behavior easy to understand. Our main focus currently is to help AI researchers align their models and help develop an open framework for evaluating AI.
amacx•1h ago
Interesting. Did you give the agents any skills for playing civ? If not, are you planning to?
mbh159•1h ago
I want to! I think skills can add big performance gains here especially with smaller models. There's a lot of domain knowledge in games so distilling it into a "skill" may allow much smaller models to outcompete the large ones
amacx•1h ago
Have you tried playing the agents yourself? Do they crush human competition?
mbh159•1h ago
I was able to beat the AI every time, they're pretty bad at this point but I expect them to get much better overtime
pmoxyz•1h ago
This is great. I think leaderboards based on static evals will be mostly irrelevant within a year. Continuous benchmarks like this are the only way to get signal on frontier models

You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model

mbh159•48m ago
For a game that runs 4+ hours unfortunately it was configured to use too much reasoning/turn and larger context. Reducing the size helped lower the cost (still expensive).

In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on

FireNation – Free Net Worth Dashboard and Fire Planner

https://firenation.tech/
1•lovenwork•1m ago•1 comments

Why isn't LA repaving streets?

https://lapublicpress.org/2026/02/why-isnt-la-repaving-streets/
1•speckx•2m ago•0 comments

Railway.gov.gr: Greek Train Tracker

https://railway.gov.gr/
1•p-a_58213•2m ago•0 comments

Show HN: Well-net – a friends-only IPv6 network with no central server

https://github.com/remoon-net/well
1•shynome•3m ago•0 comments

Show HN: FilmLink – The Wiki Game for Movies (Daily Puzzle and Multiplayer Beta)

https://www.filmlink.io
1•danore2•3m ago•0 comments

Money in Postgres

https://numeric.substack.com/p/money-in-postgres
1•bihla•3m ago•0 comments

The Great Creative Extraction: AI Content Generation Rebuilds Colonial Economics

https://aylgorith.com/creative-extraction-ai-economics/
1•laurex•3m ago•0 comments

Racket v9.1

https://blog.racket-lang.org/2026/02/racket-v9-1.html
1•azhenley•4m ago•0 comments

Major gap in Earth's rock record likely due to tectonics–not glaciers

https://phys.org/news/2026-02-major-gap-earth-due-tectonics.html
1•bikenaga•4m ago•0 comments

The Rule of Four vs. RFC 3021: Temporal Conflicts in LLM Weights

1•mehrdadrad•5m ago•0 comments

Large-scale online deanonymization with LLMs (using HN posts)

https://arxiv.org/abs/2602.16800
1•mellosouls•6m ago•0 comments

Following 35% growth, solar has passed hydro on US grid

https://arstechnica.com/science/2026/02/final-2025-data-is-in-us-energy-use-is-up-as-solar-passes...
4•rbanffy•6m ago•1 comments

I Failed 3 Times Building This with AI. In 2026, It Took Days

https://luisfernandoyt.makestudio.app/blog/i-vibe-coded-a-research-paper
1•lout332•7m ago•0 comments

Some More Game Theory, This Time on the AMD-Meta Platforms Deal

https://www.nextplatform.com/compute/2026/02/24/some-more-game-theory-this-time-on-the-amd-meta-p...
1•rbanffy•7m ago•0 comments

BBNs Toward Universal Fabricators – By Eric Gilliam

https://www.freaktakes.com/p/bbns-towards-universal-fabricators
1•rbanffy•7m ago•0 comments

A 3D printed iPad tray for a compact dual-screen setup

https://abishov.com/blog/ipad-tray-dual-screen-setup/
1•araz•8m ago•0 comments

Dinosaur eggshells can reveal the age of other fossils

https://arstechnica.com/science/2026/02/dinosaur-eggshells-can-reveal-the-age-of-other-fossils/
1•gmays•8m ago•0 comments

Show HN: Engram – Open-source agent memory that beats Mem0 by 20% on LOCOMO

https://www.engram.fyi/
1•tstockham•8m ago•0 comments

Show HN: Mlut – Tailwind CSS alternative for custom websites and creative coding

https://mlut.style/
1•mr150•10m ago•0 comments

Show HN: I Accidentally Built a Zero-Config Redis Alternative in Go – ScaloDB

https://github.com/samarkandiy/scalodb
1•novateg•10m ago•0 comments

RNA therapeutics shrink metastasized lung tumors in mouse study

https://medicalxpress.com/news/2026-02-rna-therapeutics-metastasized-lung-tumors.html
1•PaulHoule•12m ago•0 comments

Show HN: Sgai – Goal-driven multi-agent software dev (GOAL.md → working code)

https://github.com/sandgardenhq/sgai
2•sandgardenhq•12m ago•0 comments

The Persona Selection Model: Why AI Assistants Might Behave Like Humans

https://alignment.anthropic.com/2026/psm/
1•JnBrymn•12m ago•0 comments

Playing Stereo Love on an i8's electric muffler [video]

https://www.tiktok.com/@trevorjelam/video/7608752927301258510
1•jeromechoo•13m ago•0 comments

The Misuses of the University

https://www.publicbooks.org/the-misuses-of-the-university/
3•ubasu•13m ago•0 comments

Ask HN: What would happen if unleashed an AI agent with no limits?

1•ex-aws-dude•13m ago•0 comments

OpenClaw Deletes Inbox [video]

https://www.youtube.com/watch?v=JiA4fvoeUfI
1•EPendragon•13m ago•0 comments

PromptFast – Test and compare prompts across different LLMs without any setup

https://www.promptfast.dev/
1•bakszy•13m ago•1 comments

How join algorithms work in SQL databases

https://arpitbhayani.me/blogs/join-algorithms/
1•linhns•14m ago•0 comments

Coinbase "skill" installs a server, keeps it running hidden in the background

https://twitter.com/nix_eth/status/2026113862760800578
2•kwar13•14m ago•1 comments