frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Exploring LLM Evaluation by Using Games

https://lmgame.org
3•Yuxuan_Zhang13•4h ago

Comments

Yuxuan_Zhang13•4h ago
Pokémon Red is becoming a go-to benchmark for testing advanced AIs such as Gemini. But is Pokémon Red really a good eval? We study this problem and identify three issues: 1⃣ Navigation tasks are too hard. 2⃣ Combat control is too simple. 3⃣ Raising a strong Pokémon team is slow and expensive as an eval.

We find most of the problems are not fundamental to games themselves, but how they have been used. We believe game-as-an-eval remains a compelling and underutilized evaluation strategy.

We introduce Lmgame Bench to standardize game-as-an-eval. More details and findings in our blogpost: https://lmgame.org/#/blog/pokemon_red

Microsoft Confirms Xbox Handheld Console – Official Release Set for 2025

https://www.ibtimes.co.uk/microsoft-confirms-xbox-handheld-console-official-release-set-2025-1735375
1•waltercool•1m ago•0 comments

Chai-2: zero-shot antibody design in a 24-well plate

https://www.chaidiscovery.com/news/introducing-chai-2
1•Murfalo•9m ago•0 comments

Rust CLIs with Clap

https://tucson-josh.com/posts/rust-clap-cli/
1•rajman187•14m ago•0 comments

FBI arrests one man, searches laptops: North Korean tech-worker scheme

https://www.cnn.com/2025/06/30/politics/fbi-laptop-north-korea
1•everybodyknows•20m ago•0 comments

The A.I. Frenzy Is Escalating. Again.

https://www.nytimes.com/2025/06/27/technology/ai-spending-openai-amazon-meta.html
1•bookofjoe•20m ago•1 comments

Inference-Time Scaling and Collective Intelligence for Frontier AI

https://sakana.ai/ab-mcts/
2•hardmaru•24m ago•0 comments

Israel was facing destruction at the hands of Iran, and how it saved itself

https://www.timesofisrael.com/israel-was-facing-destruction-at-the-hands-of-iran-this-is-how-close-it-came-and-how-it-saved-itself/
3•nsoonhui•27m ago•0 comments

Show HN: Dev platform for generating MCP Tools

1•GentoroAI•30m ago•0 comments

The Biggest Recent Union Wins Were in Art and Bacon

https://jacobin.com/2025/06/union-elections-sva-hormel-nlrb/
1•PaulHoule•39m ago•0 comments

Trump team axes contracts with publishing giant Springer Nature

https://www.nature.com/articles/d41586-025-02080-1
4•gnabgib•39m ago•0 comments

Prediction Consensus: What the Experts See Coming in 2025

https://www.visualcapitalist.com/prediction-consensus-what-the-experts-see-coming-in-2025/
1•gmays•40m ago•0 comments

Pluto is a unique dialect of Lua with a focus on general-purpose programming

https://github.com/PlutoLang/Pluto
1•90s_dev•41m ago•1 comments

Chdig – Dig into ClickHouse with TUI Interface

https://github.com/azat/chdig
1•zX41ZdbW•41m ago•0 comments

Ask HN: Best Tech Podcasts for Professionals?

2•giantg2•51m ago•0 comments

Ars reflects on Apollo 13 turning 30

https://arstechnica.com/science/2025/06/ars-reflects-on-apollo-13-turning-30/
1•Hooke•54m ago•0 comments

The Email Startup Graveyard: Why 80%+ of Email Companies Fail

https://forwardemail.net/en/blog/docs/email-startup-graveyard-why-80-percent-email-companies-fail
5•skeptrune•58m ago•0 comments

The Dollar Has Its Worst Start to a Year Since 1973

https://www.nytimes.com/2025/06/30/business/dollar-decline-trump.html
6•speckx•59m ago•2 comments

A mammoth tusk boomerang from Poland is 40k years old

https://arstechnica.com/science/2025/06/a-mammoth-tusk-boomerang-from-poland-is-40000-years-old/
1•ksec•1h ago•0 comments

Prompt injections for better peer reviews in papers on arXiv.org

https://asia.nikkei.com/Business/Technology/Artificial-intelligence/Positive-review-only-Researchers-hide-AI-prompts-in-papers
2•tkgally•1h ago•1 comments

John Carmack (Keen Technologies): Research Directions Upper Bound 2025 [video]

https://www.youtube.com/watch?v=3pdlTMdo7pY
1•amichail•1h ago•0 comments

400 million Windows PCs vanished in 3 years. Where did they all go?

https://www.zdnet.com/article/400-million-windows-pcs-vanished-in-3-years-where-did-they-all-go/
5•breve•1h ago•5 comments

Volvo delivers 5,000th electric semi with little fanfare

https://electrek.co/2025/06/29/volvo-delivers-5000th-electric-semi-with-little-fanfare-sending-a-big-message/
3•breve•1h ago•3 comments

Assessing and Modelling Temperature Forecasts with R and Stan

https://blog.foletta.net/post/2024-08-15-bom/
1•gjf•1h ago•0 comments

Show HN: Praxos – Context Management for AI Agents

8•mogusian•1h ago•0 comments

The Path to Medical Superintelligence

https://microsoft.ai/new/the-path-to-medical-superintelligence/
4•bulla•1h ago•1 comments

Probo vs. Vanta?

1•aleksdahlberg•1h ago•0 comments

A support group for Grief rooted in children's picture books

https://childrensbookforall.org/support-group/2
1•chbkall•1h ago•0 comments

Claude Code now supports Hooks

https://docs.anthropic.com/en/docs/claude-code/hooks
76•ramoz•1h ago•31 comments

GitHub billionth repo new owner

https://github.com/Red-Killer/shit
1•alexpadula•1h ago•0 comments

Importance of context management in AI NPCs

https://walterfreedom.com/post.html?id=ai-context-management
4•walterfreedom•1h ago•2 comments