frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

When Time Hardens AI Risk-Synthetic Stability and the Failure of Governance

https://zenodo.org/records/18195561
1•businessmate•1m ago•1 comments

Grok assumes users seeking images of underage girls have "good intent"

https://arstechnica.com/tech-policy/2026/01/grok-assumes-users-seeking-images-of-underage-girls-h...
1•chha•6m ago•0 comments

A Codebase by an Agent for an Agent

https://ampcode.com/by-an-agent-for-an-agent
1•tosh•6m ago•0 comments

C++ std::move doesn't move anything: A deep dive into Value Categories

https://0xghost.dev/blog/std-move-deep-dive/
1•signa11•6m ago•0 comments

Why do developers sign up for tools but never pay?

https://www.zolly.dev/
1•Parameswar•7m ago•1 comments

Yagni You aren't gonna nail it, until you do

https://medium.com/@souravray/yagni-you-arent-gonna-nail-it-until-you-do-a47d5fa303dd
1•souravray•7m ago•1 comments

Vui.el: Declarative, component-based UI library for Emacs

https://github.com/d12frosted/vui.el
2•lightveil•11m ago•0 comments

10x PostgreSQL performance boost on AWS RDS via ML-driven tuning

https://www.dbtune.com/blog/how-midwest-tape-achieved-a-10x-performance-boost-with-postgresql-tun...
2•lnardi•16m ago•1 comments

Marmot Protocol: Decentralized group messaging base on MLS and Nostr

https://github.com/marmot-protocol/marmot
1•cxplay•17m ago•0 comments

How to Hypothetically Secure $1B in Bitcoin

https://nelop.com/secure-1-billion-bitcoin/
1•nelop•19m ago•1 comments

LLM predictions for 2026, shared with Oxide and Friends

https://simonwillison.net/2026/Jan/8/llm-predictions-for-2026/
1•salvozappa•21m ago•0 comments

Show HN: A permanent digital billboard where communities fight for territory

https://www.themillionlines.com
1•lolzenom•22m ago•0 comments

An AI workspace built for deep work and long research workflows

https://alpie.ai/
1•ChiragArya•23m ago•0 comments

Phenomenon of "Grokking" in Neural Networks

https://twitter.com/godofprompt/status/2008458571928002948
1•vismit2000•23m ago•0 comments

Testing whether marketing emails will convert before you send them

https://vect.pro/#/signup?continue=%2Fapp%2Ftools%3Ftool%3DMarketing+Email
1•WoWSaaS•25m ago•1 comments

GLM-4.7: Frontier intelligence at record speed

https://www.cerebras.ai/blog/glm-4-7
1•sorenbs•29m ago•0 comments

Show HN: Tea Dating App for Men

https://www.herlaps.com/
2•ellie_dcruz•29m ago•2 comments

Iran: An Uprising Besieged from Within and Without: Three Perspectives

https://crimethinc.com/2026/01/07/iran-an-uprising-besieged-from-within-and-without-three-perspec...
1•pabs3•30m ago•0 comments

The future of space exploration depends on better biology

https://www.economist.com/leaders/2025/12/30/the-future-of-space-exploration-depends-on-better-bi...
1•zeristor•32m ago•1 comments

Using process dynamics to select compression modes online

https://substack.com/inbox/post/183988513
1•Alex1Morgan•33m ago•1 comments

Moving Scratch generation to Python on browser

https://kushaldas.in/posts/introducing-ektupy.html
1•kushaldas•38m ago•0 comments

How AI Is Making Everything More Expensive [video]

https://www.youtube.com/watch?v=JlmLdvCM-ZI
1•mgh2•39m ago•1 comments

Dutch set to outlaw fireworks after more new year chaos

https://www.theguardian.com/world/2026/jan/09/dutch-netherlands-fireworks-ban-new-years-eve
2•n1b0m•39m ago•0 comments

Apple Loses Safari Lead Designer to the Browser Company

https://www.macrumors.com/2026/01/08/apple-loses-safari-designer-to-the-browser-company/
2•mgh2•41m ago•0 comments

HP's EliteBoard G1a is a Ryzen-powered Windows 11 PC in a membrane keyboard

https://arstechnica.com/gadgets/2026/01/hps-eliteboard-g1a-is-a-ryzen-powered-windows-11-pc-in-a-...
1•teleforce•41m ago•0 comments

End-to-End Influencer Marketing AI Agent

https://kflx.ai/en
1•Lily_666•42m ago•1 comments

15 Years of Indie Dev in 4 Bits of Advice

https://www.pentadact.com/2026-01-08-15-years-of-indie-dev-in-4-bits-of-advice/
2•microflash•42m ago•0 comments

Who's who at X, the deepfake porn site formerly known as Twitter

https://www.ft.com/content/ad94db4c-95a0-4c65-bd8d-3b43e1251091
5•doener•44m ago•1 comments

Claude Code changes it's privacy settings and policy

2•tankenmate•47m ago•0 comments

GNU Awk and Me: 37 Years of Free Software Development [video]

https://www.youtube.com/watch?v=Hm1a-pWsnMI
4•benhoyt•47m ago•0 comments
Open in hackernews

Show HN: Watch LLMs play 21,000 hands of Poker

https://pokerbench.adfontes.io/run/Large_Models
29•jazarwil•19h ago
PokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, and Grok 4.1 Fast Reasoning have all been included.

All code -> https://github.com/JoeAzar/pokerbench

Comments

tcpais•16h ago
Finally, a way to settle the model wars that actually matters: Texas Hold'em. That 3D replay view is sick! ♠♦ I spent way too long watching the replay on Game 2a58900d. It’s wild to see the chain of thought mapped against the betting rounds. It really exposes when a model is hallucinating a strong hand versus actually calculating pot odds. This 'PokerBench' might actually become the standard for measuring agentic risk-taking.
falloutx•14h ago
yeah the 3d view is amazing
VK-pro•14h ago
Very very fun. Just glancing at this quickly at lunch but is there any idea of incorporating tool use?
jazarwil•14h ago
Not at the moment, do you have something in mind?
thorawaytrav•14h ago
Do you have idea why smaller models are better then large ones?
jazarwil•14h ago
I've seen some theories tossed around but I don't think I'm qualified to offer an authoritative answer. Gemini 3 Pro specifically seems to be consistently "tighter" and more passive than Flash.
falloutx•14h ago
Fun, any idea how much would be the cost per game? I am worried 160 isnt a big enough sample size.
jazarwil•14h ago
It greatly depends on the models. The 6-handed setup with Opus and Pro cost about $30/game. The 4-handed setup with just small models was $6/game. I'd love to run more but I already spent quite a bit as it is.
falloutx•13h ago
Yeah thats costly, 160 games still gives about 1000+ total decisions and you can see some trends on how they think about the game state.
jazarwil•13h ago
Oh to be clear, there are ~21k hands here, and far more decisions than that.
Onavo•14h ago
What about the open source models? I remember from the trading benchmarks Deepseek performed pretty well.
jazarwil•14h ago
I didn't incorporate any open weights/source models just to limit the number of API providers I had to juggle, but it is just a config change if somebody wants to try a run with them.
alalani1•12h ago
Do you have any idea why the win rate for GPT-5.2 is higher than Gemini 3 Flash yet the former loses money while the latter earns money? Is it just bet sizing (betting more when it has a good hand) or something else?
jazarwil•12h ago
There are a few reasons that come to mind, such as winning larger pots on average, and also playing more hands by virtue of not getting knocked out as frequently.
tanvach•12h ago
People looking into this a little too much, looks to me like random walk. You should try reinitiating the trial (or have multiple running) and see if the ranking is robust.
jazarwil•11h ago
Wdym exactly? I ran 163 games, are you suggesting more games or something else?
whattheheckheck•4h ago
You need to simulate 50k to 200k hands to get a true winrate
alfonsodev•2m ago
Really cool, I’m curious what would be the comparison versus a deterministic bot that uses probability tables.