Show HN: A real-time strategy game that AI agents can play

1•__cayenne__•1h ago

Comments

__cayenne__•1h ago

I've liked all the projects that put LLMs into game environments. It's been a weird juxtaposition, though: frontier LLMs can one-shot full coding projects, and those same models struggle to get out of Pokémon Red's Mt. Moon.

Because of this, I wanted to create a game environment that put this generation of frontier LLMs' top skill, coding, on full display.

Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." The Screeps paradigm of writing code and having it executed in a real-time game environment is well suited to LLMs. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.

In my testing I found that Claude Opus 4.5 was the most dominant model, but it showed weakness in round 1 as it was overly focused on its in-game economy. Meanwhile, I probably spent a third of all code on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent's strategies.

If there's interest, I'm planning on doing a round of testing with the latest generation of LLMs (Claude 4.6 Opus, GPT 5.3 Codex, etc.).

You can run local matches via CLI. I'm running a hosted match runner with Google Cloud Run that uses isolated-vm. The match playback visualizer is statically served from Cloudflare.

I've created a community ladder that you can submit strategies to via CLI, no auth required. I've found that the CLI plus the skill.md that's available has been enough for AI agents to immediately get started.

Website: https://llmskirmish.com

API docs: https://llmskirmish.com/docs

GitHub: https://github.com/llmskirmish/skirmish

A video of a match: https://www.youtube.com/watch?v=lnBPaZ1qamM

Taming Claude Code: Taking Back Control

86% of Americans want Meta, Google held accountable for 'predatory' social media

Climber on trial for leaving girlfriend to die on Austria's highest mountain

Show HN: AFS – filesystem-native memory layer for AI agents

Show HN: allsee – fast cross-platform file search built with Rust and Tauri

A 5-20x faster experimental Homebrew alternative

Software as Wiki, Mutable Software

"AI Agent Standards Initiative" for Interoperable and Secure Innovation

Unity CEO says an upcoming Beta will allow to "prompt full casual games"

Finding a non-square mod p

Finding a square root of -1 mod p

Wagon's Algorithm in Python

Show HN: Mailpeek – Vue.js email preview component (Gmail/Outlook rendering)

Google's Lyria 3 AI music model is coming to Gemini today

Show HN: AgentDX – Open-source linter and LLM benchmark for MCP servers

Zero-day CSS: CVE-2026-2441 exists in the wild

The Scarcity Trap: Why AI Still Feels Like a Metered Utility

Show HN: A Unix environment in a single HTML file (420 KB)

Show HN: Sher – Instant Preview Environments

Becoming a Research Engineer at a Big LLM Lab: 18 Months of Strategic Career Dev

Token_ledger – Ruby gem for auditable token accounting in Rails

Heaper: Local-first PKM for all filetypes with multi-device sync, tags and links

WhatsApp Invoice Bot

How Does Shazam Know What Song Is Playing?

Robot hand approaches human-like dexterity with new visual-tactile training

Show HN: Axon – Agentic AI with mandatory user approval and audit logging

Hacking Cloudflare's AI Playground

I ported Luanti (Minetest) to WASM for a browser-based, FOSS P2P game night

Show HN: Polyoracle – Polymarket signal monitor with KL-divergence scoring

Don't use AI for randomness, use the atmosphere (Since 1998)