frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Advancing AI Benchmarking with Game Arena

https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/
34•salkahfi•1h ago

Comments

eamag•1h ago
Curious why they decided to curate poker hands instead of a normal poker
qsort•1h ago
Poker has very high variance, you'd need several hundred thousand hands to confidently say who's better. Also, you probably want to precompute the GTO-optimal play for benchmarking purposes.
eamag•50m ago
But now because the hands are so strong we don't see any folds
johndhi•48m ago
But can't computers play several hundred thousand poker hands easily in a couple of hours ?
tiahura•1h ago
How about nethack?
chaostheory•1h ago
Anecdotal data point, but recently I’ve found Gemini to perform better than ChatGPT when it came to intent analysis.
ofirpress•1h ago
This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -

We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.

https://codeclash.ai/

riku_iki•50m ago
Leaderboard looks very outdated..
Instantnoodl•23m ago
Cool to see core war! I feel it's mostly forgotten by now. My dad is still playing it to this day though and even attends tournaments
63stack•9m ago
>this really tough task leads to very interesting findings on AI for coding

Are you going to share those with the class or?

cv5005•1h ago
My personal threshold for AGI is when an AI can 'sit down' - it doesn't need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn't pre-trained on (it can train on older games).
bob1029•9m ago
https://arxiv.org/abs/2507.03793
10xDev•40m ago
If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.
Davidzheng•26m ago
They should be allowed to! In fact i think better benchmark would be to invent new games and test the models ability to allocate compute to minmax/alphazero new games in compute constraints
simianwords•9m ago
Its the same reason we are asked to write exams without using calculators but the real world does have them.

How you work without calculators is a proxy for real world competency.

10xDev•3m ago
Funny, you used probably the most useless form of benchmarking used on people as an example of "competency" in the real world.
simianwords•27m ago
Gemini tops all benchmarks but when it comes to real world usage it is genuinely unusable
goniszewski•16m ago
It’s not that bad. I’ve been using 3 Pro for some time now and I’m quite happy with how it works. Best paired with Opus and Codex, like most models, but it’s solid as a full-stack buddy.
bennyfreshness•19m ago
Wow. I'm generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who's played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?
bilekas•6m ago
Good question, but who's going to stop them?

AI already has a very creative imagination for role play so this just adds extra to their arsenal.

Show HN: Parano.ai – Continuous Competitor Monitoring

https://parano.ai
1•mlukaszczyk•2m ago•0 comments

Interest in a "Who's looking for funding?" post

2•gushogg-blake•4m ago•0 comments

Don't buy fancy wall art city maps, make your own with this free script

https://www.howtogeek.com/dont-buy-fancy-wall-art-city-maps-make-your-own-with-this-free-script/
1•Krasnol•4m ago•0 comments

Show HN: AiDex Tree-sitter code index as MCP server (50x less AI context usage)

https://github.com/CSCSoftware/AiDex
1•ultrafox42•6m ago•1 comments

Python, Is It Being Killed by Incremental Improvements?

https://www.youtube.com/watch?v=03DswsNUBdQ
1•todsacerdoti•9m ago•0 comments

Ghostty nightly now supports the `click_events` extension

https://twitter.com/mitchellh/status/2018400993466331431
1•tosh•9m ago•0 comments

Futureproofing Tines: Partitioning a 17TB Table in PostgreSQL – Tines

https://www.tines.com/blog/futureproofing-tines-partitioning-a-17tb-table-in-postgresql/
1•vinnyglennon•9m ago•0 comments

PGlite: Embeddable Postgres

https://github.com/electric-sql/pglite
1•KolmogorovComp•12m ago•0 comments

First Contact with America

https://novum.substack.com/p/first-contact-with-america
1•paulpauper•14m ago•0 comments

The Dot-Com Optimists Got a Lot Right

https://www.bloomberg.com/news/newsletters/2026-02-01/what-mary-meeker-s-internet-trend-reports-c...
2•paulpauper•14m ago•0 comments

Pink noise reduces REM sleep and may harm sleep quality

https://medicalxpress.com/news/2026-01-pink-noise-rem-quality.html
1•bikenaga•15m ago•1 comments

David Alan Grier Speaks on the History of Computing: Full Interview [video]

https://www.youtube.com/watch?v=NJckzrDpbUA
1•oldnetguy•15m ago•0 comments

Researchers Find OpenClaw Instances Exposed to the Internet

https://protean-labs.io/blog/researchers-find-thousands-of-openclaw-instances-exposed
1•birdculture•15m ago•0 comments

Common bacteria (Chlamydia) discovered in the eye linked to cognitive decline

https://medicalxpress.com/news/2026-02-common-bacteria-eye-linked-cognitive.html
4•bikenaga•19m ago•1 comments

Adoption of electric vehicles tied to real-world reductions in air pollution

https://phys.org/news/2026-01-electric-vehicles-real-world-reductions.html
1•Teever•20m ago•0 comments

Police facial recognition is now highly accurate, but public awareness lags

https://theconversation.com/facial-recognition-technology-used-by-police-is-now-very-accurate-but...
4•gnabgib•21m ago•1 comments

What we've been getting wrong about AI's truth crisis

https://www.technologyreview.com/2026/02/02/1132068/what-weve-been-getting-wrong-about-ais-truth-...
1•cmsefton•21m ago•0 comments

The Bash Reference Manual Is in the Epstein Files

https://mastodon.social/@sjvn/116002496494323705
3•paulfitz•21m ago•1 comments

My Free Press Column on Moltbook

https://marginalrevolution.com/marginalrevolution/2026/02/my-free-press-column-on-moltbook.html
1•paulpauper•21m ago•0 comments

A free MCU watch tracker for Avengers: Doomsday

https://doomsdayrdy.vercel.app/
1•AlonsoGP•23m ago•1 comments

Doom on Emacs

https://github.com/minad/doom-on-emacs
1•ashton314•23m ago•0 comments

Software Engineering with LLMs

https://jamison.dance/02-02-2026/software-engineering-with-llms
2•jergason•24m ago•0 comments

Prompt Engineering Basics for Better AI Outputs

https://mem0.ai/blog/prompt-engineering-complete-guide
1•ninadwrites•25m ago•0 comments

Codex App

https://developers.openai.com/codex/app/
2•tosh•25m ago•1 comments

Show HN: Deterministic event logs with explicit gap markers (NDJSON proof)

https://github.com/yupme-bot/kernel-v1.1-ndjson-proof
1•Slaine•25m ago•1 comments

Power Aware Dynamic Reallocation for Inference

https://arxiv.org/abs/2601.12241
3•PaulHoule•27m ago•0 comments

Show HN: Mortgage Payment Calculator (fast, no signup)

https://toolvault.co/tools/mortgage-payment-calculator
1•Aaevro•28m ago•1 comments

The origin story of the modern computer you’ve probably never heard, David Grier

https://www.youtube.com/watch?v=dHy5nT-5e9M
1•oldnetguy•30m ago•0 comments

Show HN: Open-Source Terminal UI for Kamal Deploy Management

https://github.com/shuvro/lazykamal
1•shuvrokhan•31m ago•0 comments

The Codex App – OpenAI

https://twitter.com/ajambrosino/status/2018385459936923656
1•abinaya_rl•31m ago•1 comments