Show HN: Watch LLMs play 21,000 hands of Poker

https://pokerbench.adfontes.io/run/Large_Models

36•jazarwil•1mo ago

PokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, and Grok 4.1 Fast Reasoning have all been included.

All code -> https://github.com/JoeAzar/pokerbench

Comments

tcpais•1mo ago

Finally, a way to settle the model wars that actually matters: Texas Hold'em. That 3D replay view is sick! ♠♦ I spent way too long watching the replay on Game 2a58900d. It’s wild to see the chain of thought mapped against the betting rounds. It really exposes when a model is hallucinating a strong hand versus actually calculating pot odds. This 'PokerBench' might actually become the standard for measuring agentic risk-taking.

falloutx•1mo ago

yeah the 3d view is amazing

VK-pro•1mo ago

Very very fun. Just glancing at this quickly at lunch but is there any idea of incorporating tool use?

jazarwil•1mo ago

Not at the moment, do you have something in mind?

thorawaytrav•1mo ago

Do you have idea why smaller models are better then large ones?

jazarwil•1mo ago

I've seen some theories tossed around but I don't think I'm qualified to offer an authoritative answer. Gemini 3 Pro specifically seems to be consistently "tighter" and more passive than Flash.

falloutx•1mo ago

Fun, any idea how much would be the cost per game? I am worried 160 isnt a big enough sample size.

jazarwil•1mo ago

It greatly depends on the models. The 6-handed setup with Opus and Pro cost about $30/game. The 4-handed setup with just small models was $6/game. I'd love to run more but I already spent quite a bit as it is.

falloutx•1mo ago

Yeah thats costly, 160 games still gives about 1000+ total decisions and you can see some trends on how they think about the game state.

jazarwil•1mo ago

Oh to be clear, there are ~21k hands here, and far more decisions than that.

Onavo•1mo ago

What about the open source models? I remember from the trading benchmarks Deepseek performed pretty well.

jazarwil•1mo ago

I didn't incorporate any open weights/source models just to limit the number of API providers I had to juggle, but it is just a config change if somebody wants to try a run with them.

alalani1•1mo ago

Do you have any idea why the win rate for GPT-5.2 is higher than Gemini 3 Flash yet the former loses money while the latter earns money? Is it just bet sizing (betting more when it has a good hand) or something else?

jazarwil•1mo ago

There are a few reasons that come to mind, such as winning larger pots on average, and also playing more hands by virtue of not getting knocked out as frequently.

tanvach•1mo ago

People looking into this a little too much, looks to me like random walk. You should try reinitiating the trial (or have multiple running) and see if the ranking is robust.

jazarwil•1mo ago

Wdym exactly? I ran 163 games, are you suggesting more games or something else?

whattheheckheck•4w ago

You need to simulate 50k to 200k hands to get a true winrate

jazarwil•4w ago

I'd love to run more games, just very expensive unfortunately.

alfonsodev•4w ago

Really cool, I’m curious what would be the comparison versus a deterministic bot that uses probability tables.

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Show HN: Watch LLMs play 21,000 hands of Poker

Comments