frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Pure-vision browser agent scores 94% on WebVoyager (SOTA)

https://github.com/magnitudedev/webvoyager
5•anerli•7h ago

Comments

anerli•7h ago
Hey HN, Anders and Tom from Magnitude (YC S25) here. On our last Show HN post about our open-source browser agent, someone left a comment - "there are multiple similar projects like this posted here daily, and this one likely isn't the best". So we asked ourselves, are they right? We decided to run on WebVoyager (a well known benchmark for browser agents) to test ourselves. We scored 94%, beating all other browser agents and making Magnitude state-of-the-art.

You can view the entire run here: https://magnitude-webvoyager.vercel.app/

The original WebVoyager benchmark was meant to demonstrate a new technique for interacting with the browser by annotating the DOM. Since then, vision models have come a long way in terms of accuracy and visual understanding. Our pure-vision approach with our framework and today's models surpasses the hybrid DOM strategies used by the original WebVoyager paper and other agents like browser-use.

So why does pure-vision beat hybrid DOM approaches?

- Generalizes far better - handles canvas elements, iframes, drag-and-drop, precise text selection, and many other scenarios elegantly where hybrid DOM would struggle and need to implement hacks for those cases to work

- Easier for the LLM - we think LLM performance is roughly proportional to prompt clarity. If the prompt contains a crowded screenshot with loads of colored boxes + a long list of element labels and is asked to pick one, vs given a clean screenshot + where do you want to click - the latter seems far easier

We believe another reason for our success is that we can still hook into the browser as needed. We can use browser-native actions like tab switching, can look at network traffic to know when a page is ready, or use the DOM for other purposes like data extraction. Computer use agents like Operator or Claude Computer Use on the other hand are limited to generic mouse and keyboard controls.

It's worth mentioning that WebVoyager is a strange and flawed benchmark. It contains many tasks that depend on the current date (and need their dates updated), tasks that depend on the time of day, and some tasks that are impossible or too ambiguous to properly evaluate. In the repo we detailed exactly the patches we made to the original WebVoyager benchmark such that each task is at least theoretically possible.

Why does this all matter? People are trying to adopt agents for real use cases, but they often fail to make it to production. We want to enable developers to build with production-ready browser agents - which is why it's important to get the fundamental interaction paradigm right. We think this benchmark is a step in the right direction, showing that pure-vision has best-in-class performance in the browser domain. Curious to hear what others think about this, would love to get your feedback!

AI-Enabled Coups: How a Small Group Could Use AI to Seize Power

https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power
1•jam•16s ago•0 comments

End the Tax Break for Litigation Funders

https://www.wsj.com/opinion/end-the-tax-break-for-litigation-funders-policy-law-dec9b610
1•sandwichsphinx•3m ago•0 comments

Show HN: I built a single API to post on all social platforms

https://www.postforme.dev
1•calebpanza•3m ago•1 comments

Baseball, Pitcher Wins, and Life

https://jacobbrazeal.wordpress.com/2025/07/08/baseball-pitcher-wins-and-life/
1•tibbar•8m ago•0 comments

Jack Dorsey: Bitchat

https://twitter.com/jack/status/1941989439237955773
1•threecats•8m ago•0 comments

The Catholic Case Against Artificial Intelligence

https://thewalrus.ca/pope-leo-artificial-intelligence/
1•pseudolus•9m ago•0 comments

Show HN: Gore – A Doom Engine Port in Go

https://github.com/AndreRenaud/gore
1•EstIgnavus•9m ago•0 comments

Tesla Robotaxi Review: FSD v13 with Remote Intervention – Real Test Results

https://gearmusk.com/2025/07/07/tesla-robotaxi-review-0707/
1•maxmarrfun•17m ago•1 comments

Billionaires and Their Basilisk: The Beliefs Behind the AI Vanguard

https://metapsychosis.com/billionaires-and-their-basilisk/
1•labrador•17m ago•0 comments

How Silicon Valley Got Rich

https://www.elysian.press/p/how-silicon-valley-got-rich
1•dxs•17m ago•0 comments

I Take Gifts Seriously

https://www.honest-broker.com/p/why-i-take-gifts-seriously
2•paulpauper•23m ago•0 comments

Curious chimps and nosy kids: new study shows it's only natural to love drama

https://cosmosmagazine.com/people/behaviour/curious-chimps-and-nosy-kids-new-study-shows-its-only-natural-to-love-drama/
6•Bluestein•24m ago•0 comments

What Is a Deadstick Landing and How Do Pilots Pull Them Off?

https://www.slashgear.com/1901409/dead-stick-landing-explained-how-pilots-perform/
5•Bluestein•26m ago•0 comments

Soya, a platform where founders find where there target users are online

https://soya-platform.vercel.app/
2•Taikhoom10•28m ago•1 comments

Indonesia's Mount Lewotobi volcano erupts and sends searing-hot ash miles high

https://apnews.com/article/mount-lewotobi-laki-laki-volcano-eruption-e2b79474c192ee1ba8290946a28e4a54
2•teleforce•29m ago•0 comments

Diff Synapse – Leverage AI to help make sense of code changes

https://marketplace.visualstudio.com/items?itemName=VolarTools.diff-synapse
1•elushine•29m ago•1 comments

'Space ice' is less like water than we thought

https://www.ch.cam.ac.uk/news/%E2%80%98space-ice%E2%80%99-less-water-we-thought
2•geox•32m ago•0 comments

Citrix NetScaler Memory Disclosure (CitrixBleed 2 CVE-2025-5777)

https://labs.watchtowr.com/how-much-more-must-we-bleed-citrix-netscaler-memory-disclosure-citrixbleed-2-cve-2025-5777/
1•gnabgib•33m ago•0 comments

Edible Microlasers Could Revolutionize Food Tracking and Safety

https://gizmodo.com/edible-microlasers-could-revolutionize-food-tracking-and-safety-2000624018
6•Bluestein•33m ago•0 comments

Remotely Wipe a Server (2011)

https://www.linuxjournal.com/content/remotely-wipe-server
1•noperator•36m ago•1 comments

Why Elixir? A Rebuttal to Common Misconceptions

https://matthewsinclair.com/blog/0181-why-elixir
6•matthewsinclair•38m ago•0 comments

Handling unique indexes on large data in PostgreSQL

https://volodymyrpotiichuk.com/blog/articles/unique-indexes-on-large-data-in-postgres-sql
1•0x54MUR41•45m ago•0 comments

AWS MCP Servers

https://awslabs.github.io/mcp/
2•ijidak•51m ago•0 comments

Legal Metrology Meets the Digital Age

https://www.nist.gov/news-events/news/2025/07/legal-metrology-meets-digital-age
2•gnabgib•52m ago•0 comments

Bear-Sized Giant Beavers Once Roamed North America

https://www.smithsonianmag.com/smart-news/bear-sized-giant-beaver-once-roamed-north-america-and-theyre-now-the-official-state-fossil-of-minnesota-180986937/
2•noleary•54m ago•1 comments

Maryland's New Tech Tax Targets Digital Services

https://www.salestaxinstitute.com/resources/marylands-new-tech-tax-targets-digital-services
2•ivewonyoung•56m ago•0 comments

AI Empowers Novices to Launch Cyberattacks

https://cacm.acm.org/news/ai-empowers-novices-to-launch-cyberattacks/
2•pseudolus•59m ago•0 comments

Citizen Scientists Help Confirm Distant Exoplanet

https://www.universetoday.com/articles/worldwide-team-of-citizen-scientists-help-confirm-a-tricky-exoplanet
2•pseudolus•1h ago•0 comments

Australia Wants to Bar Children from Social Media. Can It Succeed?

https://www.nytimes.com/2025/07/06/world/australia/kids-social-media-ban.html
9•bookofjoe•1h ago•2 comments

The latest threat from the rise of Chinese manufacturing

https://www.technologyreview.com/2025/07/07/1119658/the-latest-threat-from-the-rise-of-chinese-manufacturing/
4•ironyman•1h ago•1 comments