frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: A real-time strategy game that AI agents can play

https://llmskirmish.com/
53•__cayenne__•1h ago
I've liked all the projects that put LLMs into game environments. It's been a weird juxtaposition, though: frontier LLMs can one-shot full coding projects, and those same models struggle to get out of Pokémon Red's Mt. Moon.

Because of this, I wanted to create a game environment that put this generation of frontier LLMs' top skill, coding, on full display.

Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." The Screeps paradigm of writing code and having it executed in a real-time game environment is well suited to LLMs. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.

In my testing I found that Claude Opus 4.5 was the most dominant model, but it showed weakness in round 1 as it was overly focused on its in-game economy. Meanwhile, I probably spent a third of all code on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent's strategies.

If there's interest, I'm planning on doing a round of testing with the latest generation of LLMs (Claude 4.6 Opus, GPT 5.3 Codex, etc.).

You can run local matches via CLI. I'm running a hosted match runner with Google Cloud Run that uses isolated-vm. The match playback visualizer is statically served from Cloudflare.

I've created a community ladder that you can submit strategies to via CLI, no auth required. I've found that the CLI plus the skill.md that's available has been enough for AI agents to immediately get started.

Website: https://llmskirmish.com

API docs: https://llmskirmish.com/docs

GitHub: https://github.com/llmskirmish/skirmish

A video of a match: https://www.youtube.com/watch?v=lnBPaZ1qamM

Comments

hmontazeri•1h ago
This is actually fun to watch :D
egeozcan•1h ago
This is amazing. What I do is something else: I make AI agents develop AI scripts (good ol' computer player scripts) and try to beat each other:

https://egeozcan.github.io/unnamed_rts/game/

I occasionally run my tournament script: https://github.com/egeozcan/unnamed_rts/blob/main/src/script...

That calculates the ELOs for each AI implementation, and I feed it to different agents so they get really creative trying to beat each other. Also making rule changes to the game and seeing how some scripts get weaker/stronger is a nice way to measure balance.

Funny thing, Codex gets really aggressive and starts cheating a lot of times: https://bsky.app/profile/egeozcan.bsky.social/post/3mfdtj5dh...

wongarsu•1h ago
I know visualization is far from the most important goal here, but it really gets me how there's fairly elaborately rendered terrain, and then the units are just unnamed roombas with hard to read status indicators that have no intuitive meaning. Even in the match viewer I have no clue what's going on, there is no overlay or tooltip when you hover or click units either. There is a unit list that tries (and mostly fails) to give you some information, but because units don't have names you have to hover them in the list to have them highlighted in the field (the reverse does not work). Not exactly a spectator sport. Oh, but there is a way to switch from having all units in one sidebar to having one sidebar per player, as if that made a difference.

I find this pretty funny because it seems like a perfect representation of what's easy with today's tools and what isn't

Love the idea though

embedding-shape•1h ago
Yeah, it's all what you get when you basically ask an agent "Build X" without any constraints about how the UI and UX actually should work, and since the agents have about 0 expertise when it comes to "How would a human perceive and use this?", you end up with UIs that don't make much sense for humans unless you strictly steer them with what you know.
datawars•1h ago
Great project! It would be interesting to have a meta layer of AIs betting on the player LLMs
xanth•1h ago
Now I'd love to see if fast > smart over time with Mercury 2.
PeterUstinox•59m ago
Wouldn't it be interesting if the LLMs would write realtime RTS-commands instead of Code? After all it is a RTS game.

This would bring another dimension to it since then quality of tokens would be one dimension (RTS-language: Decision Making) and speed of tokens the other (RTS-language: Actions Per Minute; APM).

Also there are a lot of coding benchmarks, that way it would test something more abstract, similar to AlphaStar https://en.wikipedia.org/wiki/AlphaStar_(software)

You could just use the exposed APIs of OpenAI, Anthropic etc. and let them battle.

cahaya•51m ago
Nice. Curious about 5.3-codex-high results
busfahrer•39m ago
This reminds me of this yearly StarCraft AI competition (since 2010), however I think it uses a special API that makes it easy for bots to access the game

Edit: Forgot link: https://davechurchill.ca/starcraft/

ph4rsikal•26m ago
Reminds me of this fantastic series on Game Theory and Agent Reasoning https://jdsemrau.substack.com/p/nemotron-vs-qwen-game-theory...
EwanG•17m ago
At least until one of the competitors is overheard saying "A strange game. The only winning move is not to play"
dakolli•6m ago
Yay, I love how we just keep coming up with magic tricks, like toddlers playing with velcro.. These magic tricks do nothing but convince people who don't know any better that LLMs are the real deal, when they simply aren't.

This is just free propaganda for Anthropic && OpenAI who will leverage these (useless) capabilities to convince your boss to give your salary to them, or at least a substantial portion of it.

p-e-w•4m ago
Yeah, I guess the tens of thousands of PhDs who are working on LLMs full time are just collectively wasting their lives. Everyone except you is simply too dumb to see it.

Danish government agency to ditch Microsoft software (2025)

https://therecord.media/denmark-digital-agency-microsoft-digital-independence
101•robtherobber•1h ago•28 comments

Show HN: A real-time strategy game that AI agents can play

https://llmskirmish.com/
53•__cayenne__•1h ago•16 comments

I'm helping my dog vibe code games

https://www.calebleak.com/posts/dog-game/
950•cleak•18h ago•300 comments

LLM=True

https://blog.codemine.be/posts/2026/20260222-be-quiet/
104•avh3•2h ago•76 comments

Claude Code Remote Control

https://code.claude.com/docs/en/remote-control
42•empressplay•4h ago•17 comments

Pi – A minimal terminal coding harness

https://pi.dev
413•kristianpaul•13h ago•190 comments

Turing Completeness of GNU find

https://arxiv.org/abs/2602.20762
62•todsacerdoti•6h ago•12 comments

Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3

https://github.com/moonshine-ai/moonshine
261•petewarden•13h ago•60 comments

Mercury 2: Fast reasoning LLM powered by diffusion

https://www.inceptionlabs.ai/blog/introducing-mercury-2
245•fittingopposite•13h ago•103 comments

Japanese Death Poems

https://www.secretorum.life/p/japanese-death-poems-part-3
50•NaOH•2d ago•15 comments

Cl-kawa: Scheme on Java on Common Lisp

https://github.com/atgreen/cl-kawa
46•varjag•2d ago•10 comments

Mac mini will be made at a new facility in Houston

https://www.apple.com/newsroom/2026/02/apple-accelerates-us-manufacturing-with-mac-mini-production/
531•haunter•14h ago•518 comments

Show HN: Quantifying opportunity cost with a deliberately "simple" web app

https://shouldhavebought.com/
24•b0bbi•20h ago•33 comments

Hacking an old Kindle to display bus arrival times

https://www.mariannefeng.com/portfolio/kindle/
277•mengchengfeng•16h ago•75 comments

Show HN: Scheme-langserver – Digest incomplete code with static analysis

https://github.com/ufo5260987423/scheme-langserver
8•ufo5260987423•1d ago•0 comments

Nearby Glasses

https://github.com/yjeanrenaud/yj_nearbyglasses
350•zingerlio•18h ago•144 comments

I pitched a roller coaster to Disneyland at age 10 in 1978

https://wordglyph.xyz/one-piece-at-a-time
472•wordglyph•22h ago•169 comments

Show HN: Emdash – Open-source agentic development environment

https://github.com/generalaction/emdash
166•onecommit•17h ago•60 comments

Steel Bank Common Lisp

https://www.sbcl.org/
225•tosh•17h ago•90 comments

Amazon accused of widespread scheme to inflate prices across the economy

https://www.thebignewsletter.com/p/amazon-busted-for-widespread-price
476•toomuchtodo•10h ago•155 comments

Half million 'Words with Spaces' missing from dictionaries

https://www.linguabase.org/words-with-spaces.html
72•gligierko•1d ago•119 comments

Cell Service for the Fairly Paranoid

https://www.cape.co/
105•0xWTF•13h ago•113 comments

Meta problem with URPF our bundle in Boca raton

https://metafixthis.com/
31•synthesis5x•1d ago•1 comments

Hugging Face Skills

https://github.com/huggingface/skills
175•armcat•18h ago•49 comments

Anthropic Drops Flagship Safety Pledge

https://time.com/7380854/exclusive-anthropic-drops-flagship-safety-pledge/
360•cwwc•10h ago•164 comments

Corgi Labs (YC W23) Is Hiring

https://www.ycombinator.com/companies/corgi-labs/jobs/ZiEIf7a-founders-associate
1•leastsquares•10h ago

Running RISC-V in a VM to test my snaps

https://blog.popey.com/2026/02/running-risc-v-in-a-vm-to-test-my-snaps/
4•jandeboevrie•2d ago•0 comments

30 Years of Decompilation and the Unsolved Structuring Problem: Part 1 (2024)

https://mahaloz.re/dec-history-pt1
8•userbinator•3d ago•0 comments

Stripe valued at $159B, 2025 annual letter

https://stripe.com/newsroom/news/stripe-2025-update
219•jez•21h ago•217 comments

Aesthetics of single threading

https://ta.fo/aesthetics-of-single-threading/
91•todsacerdoti•3d ago•23 comments