frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

MTG Bench: Testing how well LLMs can play Magic

https://mtgautodeck.com/articles/mtg-bench/
27•CallumFerg•10h ago

Comments

danbrooks•1h ago
Very cool. I’ve been daydreaming about whether LLMs can be used to reason through gaming decisions.
OsrsNeedsf2P•1h ago
I love obscure benchmarks, and I feel like I can trust their results a lot more - afterall, they (probably) weren't benchmaxxed. RuneBench[0] is another good example (how well LLMs can play Runescape)

[0] https://maxbittker.github.io/runebench/

josh_p•1h ago
I know the author specifically did not use a rules engine in their simulation because of uncertainty on how it would affect it.

I do still wonder if adapting something like card forge for llm use would result in engaging gameplay with an llm.

https://github.com/Card-Forge/forge

CallumFerg•1h ago
I actually considered using card forge when I started this. I mostly didn't end up using it because of how much more work it would have been.

But also with a rules engine, you have to manually go though every step, and pass priority after every action.

I think it makes more sense to let an LLM play magic like a person would. On early turns it is acceptable to say "I play a land and pass" without going through every phase. And you can say "I tap all my land and play this card" without having to use a tool call and agent turn for every land tap.

Also card forge would not let you goldfish a deck. You must have opponents.

jmccaf•1h ago
Awesome ! Does this use https://mage-bench.com/ , or is it a separate project? I ran 4 local models in a tournament recently with mage-bench on an RTX 5090 ; Qwen 3.6 27B won narrowly over Gemma 4 .
CallumFerg•1h ago
No, I was not aware of that project when I made this.

I'll have to look into that project, but I also have an RTX 5090 and did a lot of testing with Qwen3.6 27B and Gemma 4 31B. I was not able to get it to play legal turns consistently. I had to keep expanding the system prompt and adding rules for edge cases. By the end, the prompt was over 10k tokens, and while it mostly make legal turns, it did not make good turns. And all the heuristics in the prompt degraded the performance and increased the cost for frontier models.

OwenCR•1h ago
Sadly this benchmark removes the part of MTG that is most interesting: the opponent(s). Without opponents you simply don't have a game. You just have a rules engine - quite boring!

I think I object more to the decks used in testing than the machines' decisions. I do have nit picks though: This hand is quite poor and should be mulliganned: https://app.mtgautodeck.com/public/benchmarks/4bd9955b-ebe1-.... The poor runout reinforces this decision.

This project is cool though, props for making it!

CallumFerg•1h ago
Admittedly, the mulligan phase system prompt is the weakest part of the project. I had to add heuristics to stop the LLMs from mulliganing down to just a few cards looking for a perfect hand. The scoring for the benchmark is mostly based on if the LLM could complete legal turns, not good turns.

https://github.com/CallumFerguson/mtg-auto-deck/blob/a877c08...

TZubiri•42m ago
Looking forward to this metric being Goodhart lawed.

Like how the strawberry example was overtrained for, or how the pelican on a bike started being used in official release posts.

gravitronic•23m ago
Magic is complicated. I looked at doing something like this but the open-ended nature where one specific card will completely change the rules or require a series of followup events or modifications to the rules engine at hand is just tremendous.
purple-leafy•6m ago
Benchmarks like this are onto something. Next frontier of llm benchmarking

Nobody ever gets credit for fixing problems that never happened (2002) [pdf]

https://web.mit.edu/nelsonr/www/Repenning=Sterman_CMR_su01_.pdf
80•sam_bristow•1h ago•18 comments

Claude Fable is relentlessly proactive

https://simonwillison.net/2026/Jun/11/fable-is-relentlessly-proactive/
28•lumpa•58m ago•11 comments

Show HN: Homebrew 6.0.0

https://brew.sh/2026/06/11/homebrew-6.0.0/
990•mikemcquaid•12h ago•237 comments

Show HN: FablePool – pool money behind a prompt, and Fable builds it in public

https://fablepool.com
252•matthewbarras•4h ago•149 comments

If you are asking for human attention, demonstrate human effort

https://tombedor.dev/human-attention-and-human-effort/
273•jjfoooo4•3h ago•76 comments

A greyscale iPhone setup that works in everyday life

https://www.fabianhemmert.com/opinions/a-greyscale-iphone-setup-that-works-in-everyday-life
38•hemmert•18h ago•19 comments

MiMo Code is now released and open-source

https://mimo.xiaomi.com/mimocode
426•apeters•11h ago•241 comments

Anthropic apologizes for invisible Claude Fable guardrails

https://www.theverge.com/ai-artificial-intelligence/948280/anthropic-claude-fable-invisible-disti...
320•rarisma•14h ago•306 comments

A jacket that harvests drinking water from the air

https://news.utexas.edu/2026/06/11/this-jacket-pulls-drinking-water-from-thin-air/
37•ilreb•3h ago•23 comments

Petition to Withdraw Canada's Bill C-22

https://www.ourcommons.ca/petitions/en/Petition/Sign/e-7416
363•hmokiguess•10h ago•123 comments

Emacs appearances in pop culture

https://ianyepan.github.io/posts/emacs-in-pop-culture/
262•ggcr•1d ago•70 comments

Software is made between commits

https://zed.dev/blog/introducing-deltadb
200•jeremy_k•9h ago•156 comments

Ear Training Practice

https://tonedear.com/
158•mattbit•3d ago•84 comments

The RCE that AMD wouldn't fix

https://mrbruh.com/amd2/
225•MrBruh•10h ago•99 comments

macOS 27 Beta breaks the ability to boot Asahi Linux

https://www.phoronix.com/news/macOS-27-Beta-Breaks-Asahi
243•josephcsible•2d ago•107 comments

Claude Fable 5: mid-tier results on coding tasks

https://www.endorlabs.com/learn/claude-fable-5-mythos-grade-hype
236•bugvader•10h ago•107 comments

Lines of code got a better publicist

https://curlewis.co.nz/posts/lines-of-code-got-a-better-publicist/
358•RyeCombinator•13h ago•248 comments

Making a vintage LLM from scratch

https://crlf.link/log/entries/260525-1/
23•croqaz•17h ago•3 comments

How a new DSL may survive in the era of LLMs

https://www.williamcotton.com/articles/how-a-new-dsl-survives-in-the-era-of-llms
13•williamcotton•11h ago•4 comments

Show HN: Boo – Screen-style terminal multiplexer built on libghostty

https://github.com/coder/boo
49•kylecarbs•5h ago•18 comments

Developer gets Half-Life running at 30 FPS on a Nokia N95

https://www.tomshardware.com/video-games/handheld-gaming/developer-gets-half-life-running-at-30-f...
220•ljf•3d ago•69 comments

Tailwind and slop apps

https://briandouglas.ie/llm-tailwind-template/
34•coneonthefloor•4h ago•18 comments

MTG Bench: Testing how well LLMs can play Magic

https://mtgautodeck.com/articles/mtg-bench/
27•CallumFerg•10h ago•11 comments

Reading for pleasure is sharply down among schoolkids, report shows

https://www.nbcnews.com/data-graphics/kids-reading-less-lower-levels-department-education-study-r...
75•freejoe76•1d ago•84 comments

Babel-USB: USB drive with every file

https://github.com/p2r3/babel-usb
26•LorenDB•1d ago•11 comments

Apple didn't revolutionize power supplies; new transistors did (2012)

https://www.righto.com/2012/02/apple-didnt-revolutionize-power.html
88•geerlingguy•8h ago•8 comments

FPS.cob: A first person shooter in COBOL

https://github.com/icitry/FPS.cob
103•MBCook•10h ago•60 comments

Waymo Premier

https://waymo.com/blog/2026/06/waymo-premier/
154•boulos•9h ago•401 comments

Open Reproduction of DeepSeek-R1

https://github.com/huggingface/open-r1
200•yogthos•12h ago•17 comments

Deconstructing Datalog

https://www.rntz.net/post/my-thesis.html
7•rntz•1h ago•0 comments