fp.

Open in hackernews

BrowseComp: The Benchmark That Tests What AI Agents Can Find

https://oss.vstorm.co/blog/browsecomp-ai-agent-benchmarks/

1•kacper-vstorm•1h ago

Comments

kacper-vstorm•1h ago

TL;DR

    BrowseComp is a web browsing benchmark, not a knowledge or reasoning test. It evaluates whether AI agents can navigate the open web to find specific, obscure information.
    Questions are “inverted” - authors start with a fact and work backwards to create a question that’s easy to verify but extremely hard to solve through search.
    Brute-force search doesn’t work. The search space is deliberately massive - thousands of papers, matches, events - making systematic enumeration impractical.
    Grading uses an LLM judge with a confidence score, creating an interesting meta-layer where one model evaluates another’s certainty.
    This benchmark reveals the gap between “can answer questions” and “can do research” - the exact capability that separates chatbots from useful AI agents.

gatreddi•48m ago

I work in compliance, and we see this daily. "Do you have an incident response plan?" is trivially easy to verify. But actually finding and assembling that evidence across AWS, Google Docs, Jira, and Slack? That's the hard part nobody benchmarks for.

Curious if BrowseComp accounts for domain-specific retrieval or if it's mostly general web search.

Everyone's Worried About Taiwan. The Real Vulnerability Is in Wales

https://medium.com/@tbelbek/everyones-worried-about-taiwan-the-real-vulnerability-is-already-in-n...

1•rdstrtwlkr•9s ago•0 comments

Dear parents, social media are yesterday's battle

https://mfioretti.substack.com/p/dear-parents-social-media-are-yesterdays

1•pabs3•3m ago•0 comments

Wrong Ban?

https://leaflessca.wordpress.com/2026/02/09/wrong-ban/

1•pabs3•3m ago•0 comments

My identity was stolen and someone is using it to catfish men – it's terrifying

https://www.bbc.co.uk/news/articles/c89kdn3e185o

1•dijksterhuis•4m ago•0 comments

The Download: Pokémon Go to train world models, and the US-China race to find a

https://www.technologyreview.com/2026/03/11/1134174/the-download-pokemon-go-train-world-models-us...

1•joozio•4m ago•0 comments

Show HN: Guardio – control your AI Agent

https://github.com/radoslaw-sz/guardio

1•radoslaw-sz•5m ago•0 comments

America and Israel built military targeting machines: Software

https://www.economist.com/international/2026/03/11/how-america-and-israel-built-vast-military-tar...

1•supernikita•5m ago•1 comments

Physicality: The New Age of UI

https://www.lux.camera/physicality-the-new-age-of-ui/

1•tosh•12m ago•0 comments

Canadian Wind Farms

https://tech.marksblogg.com/canadian-wind-farms.html

1•marklit•14m ago•0 comments

Iran's Sea Mines Are One of Its Most Powerful Weapons

https://www.wsj.com/world/middle-east/iran-sea-mines-strait-of-hormuz-85e623b7

1•sorentwo•17m ago•0 comments

LipoJaro Review 2026: The Truth Behind the "Gelatin Trick"

https://www.facebook.com/LipoJaro.Fat.Burn

2•tayzjaik•17m ago•1 comments

Iran war oil shock accelerates Southeast Asia's EV revolution

https://www.scmp.com/week-asia/lifestyle-culture/article/3345751/iran-war-oil-shock-accelerates-s...

1•KnuthIsGod•18m ago•0 comments

Show HN: AI-powered one-click translator for Pokémon GBA ROM hacks

https://github.com/Olcmyk/Meowth-GBA-Translator

4•booffa•22m ago•2 comments

How long till every major provider sets their RSI loops in motion?

1•foxindustrial•25m ago•0 comments

GSD for Claude Code: A Deep Dive into the Workflow System

https://www.codecentric.de/en/knowledge-hub/blog/the-anatomy-of-claude-code-workflows-turning-sla...

1•kiyanwang•26m ago•0 comments

WordPress debuts a private workspace that runs in the browser

https://techcrunch.com/2026/03/11/wordpress-debuts-a-private-workspace-that-runs-in-your-browser-...

1•taubek•27m ago•0 comments

Show HN: Okapi yet Another Observability Thing

https://github.com/okapi-core/okapi

1•kushal2048•28m ago•0 comments

A Practical, Structured Guide That Delivers Confidence for the CCNA

1•Dexter7711•29m ago•0 comments

Show HN: Landlook – Interactive Landlock Profiler

https://github.com/cnaize/landlook

1•cnaize•30m ago•1 comments

Ask HN: Can a word game work as a competitive strategy esport?

1•itchymitchy•30m ago•0 comments

Behold The Power of std::meta::substitute

https://brevzin.github.io/c++/2026/03/02/power-of-substitute/

1•HeliumHydride•31m ago•0 comments

Decode Messenger

https://decodemessenger.lovable.app

1•genx__•31m ago•0 comments

Edition #6

https://forgeintelligence.substack.com/p/forge-intelligence-edition-6

1•beakmull•36m ago•0 comments

I Vibe Coded the Metaverse in a Week. Now What?

https://medium.com/meta-verses/i-vibe-coded-the-metaverse-in-a-week-d5a6b0579de6

3•mpesce•36m ago•1 comments

Anthropic seeks appeals court stay of Pentagon supply-chain risk designation

https://www.reuters.com/technology/anthropic-seeks-court-stay-pentagon-supply-chain-risk-designat...

3•SilverElfin•39m ago•0 comments

Dutch ISP forwarded customers' personal data to American AI company for years

https://nltimes.nl/2026/03/11/odido-routers-forwarded-customers-personal-data-american-ai-company...

2•sergdigon•42m ago•0 comments

Shadow AI and the Compliance Gap That Won't Close Itself

https://pablooliva.de/the-closing-window/shadow-ai-and-the-compliance-gap-that-wont-close-itself/

1•pablooliva•44m ago•0 comments

AI Will Fail Like the Music Industry [video]

https://www.youtube.com/watch?v=YTLnnoZPALI

1•mkesper•44m ago•0 comments

OpenBAO the Vault fork is now being supported by the same company as FluxCD

https://channellife.com.au/story/controlplane-unveils-enterprise-support-for-openbao

1•aiman_alsari•44m ago•0 comments

Why enterprise deals stall at security review

1•gatreddi•48m ago•2 comments