Advancing AI Benchmarking with Game Arena

https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/

34•salkahfi•1h ago

Comments

eamag•1h ago

Curious why they decided to curate poker hands instead of a normal poker

qsort•1h ago

Poker has very high variance, you'd need several hundred thousand hands to confidently say who's better. Also, you probably want to precompute the GTO-optimal play for benchmarking purposes.

eamag•50m ago

But now because the hands are so strong we don't see any folds

johndhi•48m ago

But can't computers play several hundred thousand poker hands easily in a couple of hours ?

tiahura•1h ago

How about nethack?

chaostheory•1h ago

Anecdotal data point, but recently I’ve found Gemini to perform better than ChatGPT when it came to intent analysis.

ofirpress•1h ago

This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -

We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.

https://codeclash.ai/

riku_iki•50m ago

Leaderboard looks very outdated..

Instantnoodl•23m ago

Cool to see core war! I feel it's mostly forgotten by now. My dad is still playing it to this day though and even attends tournaments

63stack•9m ago

>this really tough task leads to very interesting findings on AI for coding

Are you going to share those with the class or?

cv5005•1h ago

My personal threshold for AGI is when an AI can 'sit down' - it doesn't need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn't pre-trained on (it can train on older games).

bob1029•9m ago

https://arxiv.org/abs/2507.03793

10xDev•40m ago

If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.

Davidzheng•26m ago

They should be allowed to! In fact i think better benchmark would be to invent new games and test the models ability to allocate compute to minmax/alphazero new games in compute constraints

simianwords•9m ago

Its the same reason we are asked to write exams without using calculators but the real world does have them.

How you work without calculators is a proxy for real world competency.

10xDev•3m ago

Funny, you used probably the most useless form of benchmarking used on people as an example of "competency" in the real world.

simianwords•27m ago

Gemini tops all benchmarks but when it comes to real world usage it is genuinely unusable

goniszewski•16m ago

It’s not that bad. I’ve been using 3 Pro for some time now and I’m quite happy with how it works. Best paired with Opus and Codex, like most models, but it’s solid as a full-stack buddy.

bennyfreshness•19m ago

Wow. I'm generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who's played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?

bilekas•6m ago

Good question, but who's going to stop them?

AI already has a very creative imagination for role play so this just adds extra to their arsenal.

Show HN: Parano.ai – Continuous Competitor Monitoring

Interest in a "Who's looking for funding?" post

Don't buy fancy wall art city maps, make your own with this free script

Show HN: AiDex Tree-sitter code index as MCP server (50x less AI context usage)

Python, Is It Being Killed by Incremental Improvements?

Ghostty nightly now supports the `click_events` extension

Futureproofing Tines: Partitioning a 17TB Table in PostgreSQL – Tines

PGlite: Embeddable Postgres

First Contact with America

The Dot-Com Optimists Got a Lot Right

Pink noise reduces REM sleep and may harm sleep quality

David Alan Grier Speaks on the History of Computing: Full Interview [video]

Researchers Find OpenClaw Instances Exposed to the Internet

Common bacteria (Chlamydia) discovered in the eye linked to cognitive decline

Adoption of electric vehicles tied to real-world reductions in air pollution

Police facial recognition is now highly accurate, but public awareness lags

What we've been getting wrong about AI's truth crisis

The Bash Reference Manual Is in the Epstein Files

My Free Press Column on Moltbook

A free MCU watch tracker for Avengers: Doomsday

Doom on Emacs

Software Engineering with LLMs

Prompt Engineering Basics for Better AI Outputs

Codex App

Show HN: Deterministic event logs with explicit gap markers (NDJSON proof)

Power Aware Dynamic Reallocation for Inference

Show HN: Mortgage Payment Calculator (fast, no signup)

The origin story of the modern computer you’ve probably never heard, David Grier

Show HN: Open-Source Terminal UI for Kamal Deploy Management

The Codex App – OpenAI