MTG Bench: Testing how well LLMs can play Magic

https://mtgautodeck.com/articles/mtg-bench/

27•CallumFerg•10h ago

Comments

danbrooks•1h ago

Very cool. I’ve been daydreaming about whether LLMs can be used to reason through gaming decisions.

OsrsNeedsf2P•1h ago

I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)

[0] https://maxbittker.github.io/runebench/

josh_p•1h ago

I know the author specifically did not use a rules engine in their simulation because of uncertainty on how it would affect it.

I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.

https://github.com/Card-Forge/forge

CallumFerg•1h ago

I actually considered using card forge when I started this. I mostly didn't end up using it because of how much more work it would have been.

But also with a rules engine, you have to manually go though every step, and pass priority after every action.

I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.

Also card forge would not let you goldfish a deck. You must have opponents.

jmccaf•1h ago

Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .

CallumFerg•1h ago

No, I was not aware of that project when I made this.

I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.

OwenCR•1h ago

Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!

I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.

This project is cool though, props for making it!

CallumFerg•1h ago

Admittedly, the mulligan phase system prompt is the weakest part of the project. I had to add heuristics to stop the LLMs from mulliganing down to just a few cards looking for a perfect hand. The scoring for the benchmark is mostly based on if the LLM could complete legal turns, not good turns.

https://github.com/CallumFerguson/mtg-auto-deck/blob/a877c08...

TZubiri•42m ago

Looking forward to this metric being Goodhart lawed.

Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.

gravitronic•23m ago

Magic is complicated. I looked at doing something like this but the open-ended nature where one specific card will completely change the rules or require a series of followup events or modifications to the rules engine at hand is just tremendous.

purple-leafy•6m ago

Benchmarks like this are onto something. Next frontier of llm benchmarking

Nobody ever gets credit for fixing problems that never happened (2002) [pdf]

Claude Fable is relentlessly proactive

Show HN: Homebrew 6.0.0

Show HN: FablePool – pool money behind a prompt, and Fable builds it in public

If you are asking for human attention, demonstrate human effort

A greyscale iPhone setup that works in everyday life

MiMo Code is now released and open-source

Anthropic apologizes for invisible Claude Fable guardrails

A jacket that harvests drinking water from the air

Petition to Withdraw Canada's Bill C-22

Emacs appearances in pop culture

Software is made between commits

Ear Training Practice

The RCE that AMD wouldn't fix

macOS 27 Beta breaks the ability to boot Asahi Linux

Claude Fable 5: mid-tier results on coding tasks

Lines of code got a better publicist

Making a vintage LLM from scratch

How a new DSL may survive in the era of LLMs

Show HN: Boo – Screen-style terminal multiplexer built on libghostty

Developer gets Half-Life running at 30 FPS on a Nokia N95

Tailwind and slop apps

MTG Bench: Testing how well LLMs can play Magic

Reading for pleasure is sharply down among schoolkids, report shows

Babel-USB: USB drive with every file

Apple didn't revolutionize power supplies; new transistors did (2012)

FPS.cob: A first person shooter in COBOL

Waymo Premier

Open Reproduction of DeepSeek-R1

Deconstructing Datalog

MTG Bench: Testing how well LLMs can play Magic

Comments

Nobody ever gets credit for fixing problems that never happened (2002) [pdf]

Claude Fable is relentlessly proactive

Show HN: Homebrew 6.0.0

Show HN: FablePool – pool money behind a prompt, and Fable builds it in public

If you are asking for human attention, demonstrate human effort

A greyscale iPhone setup that works in everyday life

MiMo Code is now released and open-source

Anthropic apologizes for invisible Claude Fable guardrails

A jacket that harvests drinking water from the air

Petition to Withdraw Canada's Bill C-22

Emacs appearances in pop culture

Software is made between commits

Ear Training Practice

The RCE that AMD wouldn't fix

macOS 27 Beta breaks the ability to boot Asahi Linux

Claude Fable 5: mid-tier results on coding tasks

Lines of code got a better publicist

Making a vintage LLM from scratch

How a new DSL may survive in the era of LLMs

Show HN: Boo – Screen-style terminal multiplexer built on libghostty

Developer gets Half-Life running at 30 FPS on a Nokia N95

Tailwind and slop apps

MTG Bench: Testing how well LLMs can play Magic

Reading for pleasure is sharply down among schoolkids, report shows

Babel-USB: USB drive with every file

Apple didn't revolutionize power supplies; new transistors did (2012)

FPS.cob: A first person shooter in COBOL

Waymo Premier

Open Reproduction of DeepSeek-R1

Deconstructing Datalog