Show HN: A real-time strategy game that AI agents can play

64•__cayenne__•2h ago

I've liked all the projects that put LLMs into game environments. It's been a weird juxtaposition, though: frontier LLMs can one-shot full coding projects, and those same models struggle to get out of Pokémon Red's Mt. Moon.

Because of this, I wanted to create a game environment that put this generation of frontier LLMs' top skill, coding, on full display.

Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." The Screeps paradigm of writing code and having it executed in a real-time game environment is well suited to LLMs. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.

In my testing I found that Claude Opus 4.5 was the most dominant model, but it showed weakness in round 1 as it was overly focused on its in-game economy. Meanwhile, I probably spent a third of all code on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent's strategies.

If there's interest, I'm planning on doing a round of testing with the latest generation of LLMs (Claude 4.6 Opus, GPT 5.3 Codex, etc.).

You can run local matches via CLI. I'm running a hosted match runner with Google Cloud Run that uses isolated-vm. The match playback visualizer is statically served from Cloudflare.

I've created a community ladder that you can submit strategies to via CLI, no auth required. I've found that the CLI plus the skill.md that's available has been enough for AI agents to immediately get started.

Website: https://llmskirmish.com

API docs: https://llmskirmish.com/docs

GitHub: https://github.com/llmskirmish/skirmish

A video of a match: https://www.youtube.com/watch?v=lnBPaZ1qamM

Comments

hmontazeri•1h ago

This is actually fun to watch :D

egeozcan•1h ago

This is amazing. What I do is something else: I make AI agents develop AI scripts (good ol' computer player scripts) and try to beat each other:

https://egeozcan.github.io/unnamed_rts/game/

I occasionally run my tournament script: https://github.com/egeozcan/unnamed_rts/blob/main/src/script...

That calculates the ELOs for each AI implementation, and I feed it to different agents so they get really creative trying to beat each other. Also making rule changes to the game and seeing how some scripts get weaker/stronger is a nice way to measure balance.

Funny thing, Codex gets really aggressive and starts cheating a lot of times: https://bsky.app/profile/egeozcan.bsky.social/post/3mfdtj5dh...

wongarsu•1h ago

I know visualization is far from the most important goal here, but it really gets me how there's fairly elaborately rendered terrain, and then the units are just unnamed roombas with hard to read status indicators that have no intuitive meaning. Even in the match viewer I have no clue what's going on, there is no overlay or tooltip when you hover or click units either. There is a unit list that tries (and mostly fails) to give you some information, but because units don't have names you have to hover them in the list to have them highlighted in the field (the reverse does not work). Not exactly a spectator sport. Oh, but there is a way to switch from having all units in one sidebar to having one sidebar per player, as if that made a difference.

I find this pretty funny because it seems like a perfect representation of what's easy with today's tools and what isn't

Love the idea though

embedding-shape•1h ago

Yeah, it's all what you get when you basically ask an agent "Build X" without any constraints about how the UI and UX actually should work, and since the agents have about 0 expertise when it comes to "How would a human perceive and use this?", you end up with UIs that don't make much sense for humans unless you strictly steer them with what you know.

datawars•1h ago

Great project! It would be interesting to have a meta layer of AIs betting on the player LLMs

xanth•1h ago

Now I'd love to see if fast > smart over time with Mercury 2.

PeterUstinox•1h ago

Wouldn't it be interesting if the LLMs would write realtime RTS-commands instead of Code? After all it is a RTS game.

This would bring another dimension to it since then quality of tokens would be one dimension (RTS-language: Decision Making) and speed of tokens the other (RTS-language: Actions Per Minute; APM).

Also there are a lot of coding benchmarks, that way it would test something more abstract, similar to AlphaStar https://en.wikipedia.org/wiki/AlphaStar_(software)

You could just use the exposed APIs of OpenAI, Anthropic etc. and let them battle.

cahaya•1h ago

Nice. Curious about 5.3-codex-high results

busfahrer•1h ago

This reminds me of this yearly StarCraft AI competition (since 2010), however I think it uses a special API that makes it easy for bots to access the game

Edit: Forgot link: https://davechurchill.ca/starcraft/

KeplerBoy•26m ago

Very interesting project. I'm a bit confused about the lack of hardware specification. The rules make it clear that one's bot has defined deadlines:

> Make sure that each onframe call does not run longer than 42ms. Entries that slow down games by repeatedly exceeding this time limit will lose games on time.

But I'm missing something like: "Your program will be pinned to CPU cores 5-8 and your bot has access to a dedicated RTX 5090 GPU." Also no mention about whether my bot can have network access to offload some high-level latency insensitive planning. Maybe that's just a bad idea in general, haven't played SC in ages.

ph4rsikal•49m ago

Reminds me of this fantastic series on Game Theory and Agent Reasoning https://jdsemrau.substack.com/p/nemotron-vs-qwen-game-theory...

EwanG•40m ago

At least until one of the competitors is overheard saying "A strange game. The only winning move is not to play"

dakolli•29m ago

Yay, I love how we just keep coming up with magic tricks, like toddlers playing with velcro.. These magic tricks do nothing but convince people who don't know any better that LLMs are the real deal, when they simply aren't.

This is just free propaganda for Anthropic && OpenAI who will leverage these (useless) capabilities to convince your boss to give your salary to them, or at least a substantial portion of it.

p-e-w•27m ago

Yeah, I guess the tens of thousands of PhDs who are working on LLMs full time are just collectively wasting their lives. Everyone except you is simply too dumb to see it.

dakolli•4m ago

10s of thousands of PhDs working on llms lol...

LatencyKills•25m ago

This technology exists. It isn’t just a toy. I think it is amazing to see people use it for interesting things even if it isn’t groundbreaking.

I’ve been an engineer for almost 40 years and love seeing what Claude Code can do.

Like it or not, young people will not know a world where this technology doesn’t exist. It is just part of their toolset now.

dakolli•11m ago

I'm pretty young and hate this technology with a passion. I didn't spend 100k on education, and studying for a decade to have my job reduced to being a project manager for a bot or to play with a prompt slot machine all day. This crap is reducing the thing I genuinely love doing more than anything, writing code, into nothing.. Reviewing code that lacks any sweat, any intention. I really can't stand this garbage.

I can't stand you old heads, I'm very happy for you that you got to stash away 40 years of SWE salaries. Its just ladder kicking behavior to be honest. Typical boomer, you got your nut and don't care what happens after.

25% of new college grads in STEM are unemployed and a bunch of companies (controlled by people in your age group) have laid off 400k Americans over the last 16 months while equities and profits are at an all time highs.

The replies : ItS NoT Ai, ItS cUz FrEe MoNeY fRoM CoViD HaS DrIeD uP.

myky22•25m ago

Love it! I have a similar inuitiom in my use of Gemini (3 and 3.1). Great at "turn 1" task but degrades faster than opus or gpt.

Show HN: A real-time strategy game that AI agents can play

Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3

Show HN: Quantifying opportunity cost with a deliberately "simple" web app

Show HN: Emdash – Open-source agentic development environment

Show HN: Scheme-langserver – Digest incomplete code with static analysis

Show HN: Gryt – self-hosted, open-source Discord-style voice chat

Show HN: Synlets – Assign Jira/Asana tickets to AI, get working PRs back

Show HN: Context Mode – 315 KB of MCP output becomes 5.4 KB in Claude Code

Show HN: WinterMute Local-first OSINT workbench with native Tor and AI analysis

Show HN: ArcticKey – Managed Redis (Valkey) Hosted in the EU

Show HN: Recursively apply patterns for pathfinding

Show HN: enveil – hide your .env secrets from prAIng eyes

Show HN: Tag Promptless on any GitHub PR/Issue to get updated user-facing docs

Show HN: PgDog – Scale Postgres without changing the app

Show HN: Chaos Monkey but for Audio Video Testing (WebRTC and UDP)

Show HN: Workz – Zoxide for Git worktrees (auto node_modules and .env, AI-ready)

Show HN: Babyshark – Wireshark made easy (terminal UI for PCAPs)

Show HN: Sowbot – Open-hardware agricultural robot (ROS2, RTK GPS)

Show HN: X86CSS – An x86 CPU emulator written in CSS

Show HN: AI Timeline – 171 LLMs from Transformer (2017) to GPT-5.3 (2026)

Show HN: Steerling-8B, a language model that can explain any token it generates

Show HN: Declarative open-source framework for MCPs with search and execute

Show HN: A free tool to turn your boring screenshots brutalist in seconds

Show HN: Cellarium: A Playground for Cellular Automata

Show HN: A Visual Editor for Karabiner

Show HN: ProdRescue AI – Turn Slack war-rooms and raw logs into incident reports

Show HN: CIA World Factbook Archive (1990–2025), searchable and exportable

Show HN: StreamHouse – S3-native Kafka alternative written in Rust

Show HN: Ghist – Task management that lives in your repo

Show HN: 3D Mahjong, Built in CSS

Show HN: A real-time strategy game that AI agents can play

Comments

Show HN: A real-time strategy game that AI agents can play

Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3

Show HN: Quantifying opportunity cost with a deliberately "simple" web app

Show HN: Emdash – Open-source agentic development environment

Show HN: Scheme-langserver – Digest incomplete code with static analysis

Show HN: Gryt – self-hosted, open-source Discord-style voice chat

Show HN: Synlets – Assign Jira/Asana tickets to AI, get working PRs back

Show HN: Context Mode – 315 KB of MCP output becomes 5.4 KB in Claude Code

Show HN: WinterMute Local-first OSINT workbench with native Tor and AI analysis

Show HN: ArcticKey – Managed Redis (Valkey) Hosted in the EU

Show HN: Recursively apply patterns for pathfinding

Show HN: enveil – hide your .env secrets from prAIng eyes

Show HN: Tag Promptless on any GitHub PR/Issue to get updated user-facing docs

Show HN: PgDog – Scale Postgres without changing the app

Show HN: Chaos Monkey but for Audio Video Testing (WebRTC and UDP)

Show HN: Workz – Zoxide for Git worktrees (auto node_modules and .env, AI-ready)

Show HN: Babyshark – Wireshark made easy (terminal UI for PCAPs)

Show HN: Sowbot – Open-hardware agricultural robot (ROS2, RTK GPS)

Show HN: X86CSS – An x86 CPU emulator written in CSS

Show HN: AI Timeline – 171 LLMs from Transformer (2017) to GPT-5.3 (2026)

Show HN: Steerling-8B, a language model that can explain any token it generates

Show HN: Declarative open-source framework for MCPs with search and execute

Show HN: A free tool to turn your boring screenshots brutalist in seconds

Show HN: Cellarium: A Playground for Cellular Automata

Show HN: A Visual Editor for Karabiner

Show HN: ProdRescue AI – Turn Slack war-rooms and raw logs into incident reports

Show HN: CIA World Factbook Archive (1990–2025), searchable and exportable

Show HN: StreamHouse – S3-native Kafka alternative written in Rust

Show HN: Ghist – Task management that lives in your repo

Show HN: 3D Mahjong, Built in CSS