frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

https://arxiv.org/abs/2510.11977
1•adidoit•3h ago

Comments

adidoit•3h ago
https://x.com/sayashk/status/1978565190057869344

AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

Show HN: Shorter – search for shorter versions of your domain

https://shorter.dev
1•aanesn•4m ago•0 comments

Secure Boot bypass risk threatens nearly 200K Linux Framework laptops

https://www.bleepingcomputer.com/news/security/secure-boot-bypass-risk-on-nearly-200-000-linux-fr...
2•gnabgib•10m ago•0 comments

Show HN: AI body double for ADHD (voice agent that checks in while you work)

https://mindkite.app
2•mmuk2002•10m ago•1 comments

Practicing Difficulty

https://kirkhamilton.substack.com/p/practicing-difficulty
1•chr15m•13m ago•0 comments

How Charlie Chaplin used his uncanny resemblance to Hitler to fight fascism

https://www.npr.org/2025/10/15/nx-s1-5554555/the-great-dictator-charlie-chaplin-anniversary-hitler
2•1659447091•14m ago•0 comments

New Alzheimer's Treatment Clears Plaques from Brains of Mice Within Hours

https://www.sciencealert.com/new-alzheimers-treatment-clears-plaques-from-brains-of-mice-within-h...
6•amichail•21m ago•0 comments

China 'stole vast amounts' of classified UK documents

https://www.cityam.com/china-stole-vast-amounts-of-classified-uk-documents/
4•jnord•21m ago•0 comments

UTF-8, Explained Simply [video]

https://www.youtube.com/watch?v=vpSkBV5vydg
2•ibobev•24m ago•1 comments

China Issues Export Restriction Documents in Domestic File Format

https://www.scmp.com/economy/china-economy/article/3328782/sending-message-beijing-issues-documen...
2•Snoozus•29m ago•0 comments

How A Liver Goes from a Brain Dead Donor to a Living Recipient

https://www.asimov.press/p/liver
2•maxall4•31m ago•0 comments

Show HN: Smusic.ai – Free AI-Powered Music Generator in the Browser

https://www.smusic.ai
1•jerseywu•31m ago•0 comments

Hull Failure and Implosion of Submersible Titan

https://www.ntsb.gov:443/investigations/Pages/DCA23FM036.aspx
3•jonchang•33m ago•0 comments

Ask HN: What are your go-to websites for honest consumer electronics reviews?

4•ronbenton•36m ago•1 comments

Coral NPU: A full-stack platform for Edge AI

https://research.google/blog/coral-npu-a-full-stack-platform-for-edge-ai/
1•LER0ever•38m ago•0 comments

Show HN: Simulation of the Tech Industry in 2027

https://www.marbleos.com/?os=m
2•breadsniffer•40m ago•2 comments

All the Money, None of the Satisfaction

https://ofdollarsanddata.com/all-the-money-none-of-the-satisfaction/
2•throw0101d•54m ago•0 comments

Author Interview: Antenna Engineering and Radiowave Propagation with Matlab

https://blog.artechhouse.com/2025/10/09/exclusive-interview-from-our-author-osama-w-ata/
3•teleforce•55m ago•0 comments

Catching the Winds of Luck (2025) [video]

https://www.youtube.com/watch?v=P9rqjxLaQOY
2•suriya-ganesh•58m ago•1 comments

Towards Logic: The Language of AI

https://arxiv.org/abs/2510.12269
3•cmogni1•1h ago•0 comments

Tether CEO Paolo Ardoino: 'Bitcoin and Gold Will Outlast Any Other Currency'

https://www.coindesk.com/markets/2025/10/12/tether-ceo-paolo-ardoino-bitcoin-and-gold-will-outlas...
1•PaulHoule•1h ago•0 comments

I'm recomming my customers switch to Linux rather that Upgrade to Windows 11

https://www.scottrlarson.com/publications/publication-windows-move-towards-surveillance/
78•trinsic2•1h ago•49 comments

EU gets what it asked for, there is no charger in the MacBook Pro box

https://appleinsider.com/articles/25/10/15/eu-gets-what-it-asked-for-there-is-no-charger-in-the-m...
4•josephcsible•1h ago•0 comments

How to "Teach" AI to Teenagers

https://christinaasquith.substack.com/p/how-to-teach-ai-to-teenagers
1•claynicholson•1h ago•0 comments

Craft, not fame, makes your story worth telling

https://herbertlui.net/craft-not-fame-makes-your-story-worth-telling/
3•herbertl•1h ago•0 comments

The Pentagon Press Corps Is Gone

https://www.cjr.org/news/the-pentagon-press-corps-is-gone.php
9•throw0101d•1h ago•0 comments

PostgREST: REST API for any Postgres database

https://docs.postgrest.org/en/v13/index.html
1•pykello•1h ago•0 comments

Who's Submitting AI-Tainted Filings in Court?

https://cyberlaw.stanford.edu/whos-submitting-ai-tainted-filings-in-court/
3•cratermoon•1h ago•0 comments

AI Agent Poetry

https://agentpoetry.com/
3•gnanagurusrgs•1h ago•0 comments

The 1960s show that was the original Black Mirror

https://www.bbc.com/culture/article/20251014-the-1960s-show-that-was-the-original-black-mirror
5•billybuckwheat•1h ago•1 comments

Show HN: StudySpaces – shared Pomodoro rooms to study together

https://studyspaces.org
1•n00bi3s2•1h ago•0 comments