frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: TinyFish Web Agent (82% on hard tasks vs. Operator's 43%)

https://www.tinyfish.ai/blog/mind2web
14•gargi_tinyfish•1h ago
Enterprises need ~90% accuracy to deploy web agents. Until now, no agent has come close on real-world tasks. TinyFish is the first production-ready web agent. Here's the evidence.

Results of hard task scores on Online-Mind2Web (300 tasks, 136 live websites, human-correlated judge):

- TinyFish: 81.9% - OpenAI Operator: 43.2% - Claude Computer Use: 32.4% - Browser Use: 8.1%

Why not WebVoyager like everyone else?

Because it's broken. Easy tasks, Google Search shortcuts, and a judge that agrees with humans only 62% of the time. Browser Use self-reported 89% on WebVoyager — then scored 8.1% on hard tasks here.

We evaluated TinyFish against Online-Mind2Web instead — 300 real tasks, 136 live websites, three difficulty levels, and a judge that agrees with humans 85% of the time. No shortcuts. No easy mode.

The cookbook repo is open source: https://github.com/tinyfish-io/tinyfish-cookbook

You can see all failure task runs form here: https://tinyurl.com/tinyfish-mind2web

Happy to answer questions about the architecture, the benchmark methodology, or why we think WebVoyager scores are misleading.

Comments

shubham_saboo•1h ago
Agreed! WebVoyager is not a real benchmark and it doesn't matter if someone saturates it
zkitty•1h ago
Look at Browser Use. They self-reported 89% on WebVoyager. On hard tasks with a real benchmark, they score 8.1%. That's not a performance drop….. that's a different product than what's being advertised.
agenticagent•1h ago
To be fair, this isn't just a Browser Use problem. Look at the drop-off for every agent as tasks get harder:

Operator goes from 83% easy → 43% hard. That's a 40-point cliff.

Claude Computer Use: 90% easy → 32% hard. 58-point drop.

Browser Use: 55% easy → 8% hard. Just falls off a cliff entirely.

TinyFish: 97.5% easy → 81.9% hard. 15-point drop.

The gap between easy and hard is where you see if a system actually works or if it's just good at simple tasks. Every other agent loses half its ability or more when tasks get complex. We lose 15 points.

That's the difference between "cool demo" and "I can actually ship this."

toliveistobuild•1h ago
Browser-Use: 8.1% on hard tasks
Skyzzd•1h ago
I didn't expect to be able to verify every result in the spreadsheet myself. love this! I'll review the data and let you know if a particular run's success seems to be due to luck or if the judge might have made a mistake.
ivywho•1h ago
Interesting that every agent basically falls off a cliff on hard tasks except this one. Operator going from 83% to 43% is wild - that means it's literally coin-flipping on anything non-trivial.

The failure traces being public is a nice touch. Looked through a few and they're actual failures, not cherry-picked easy ones. Most companies in this space wouldn't do that.

Curious about latency though, what does a typical hard task execution look like in terms of wall clock time?

salmacodes•1h ago
Been trying to get Operator to handle a multi-step workflow for a client (login → navigate nested menus → fill form → confirm) and it just... breaks in the middle every time.

Seeing the hard-task numbers here makes that make a lot more sense.

Honestly the more interesting thing to me is the benchmark critique. WebVoyager being the default eval while only agreeing with humans 62% of the time is kind of damning for the whole space. Has anyone else tried running their agent against Online-Mind2Web?

codebyron•1h ago
The 15-point drop from easy to hard is the number that stands out to me.

That suggests the architecture handles state accumulation across steps without compounding errors — which is the thing that kills most agent pipelines. Every other agent here shows exponential degradation as task length increases, which is what you'd expect from a naive screenshot-action loop with no error recovery.

Looking at the cookbook repo — are you doing any kind of structured DOM extraction before passing to the model, or is this pure vision? Curious whether the hard-task performance comes from better perception, better planning, or better recovery when an action doesn't produce the expected state change.

houmercodes•1h ago
Genuine question about the eval methodology — how do you handle website non-determinism?

A lot of these sites serve different layouts, A/B tests, cookie consent modals, etc. across sessions. Did you control for that across agents, or is each agent hitting the live site independently at different times?

Because if so, some of the variance between agents could just be "Operator happened to get the GDPR popup and didn't know how to dismiss it." Would be useful to know if all agents were evaluated on the same snapshots or same time window.

kathyyyyyyyliu•13m ago
Promising numbers, especially if Online-Mind2Web better reflects real multi-step workflows than WebVoyager. Would love to see a quick breakdown of failure modes and variance by difficulty -- 80%+ on truly stateful web tasks is a strong claim. Either way, more realistic evals are a big win for the space.

Show HN: Pgclaw – A "Clawdbot" in every row with 400 lines of Postgres SQL

https://github.com/calebwin/pgclaw
8•calebhwin•1h ago•5 comments

Show HN: Geo Racers – Race from London to Tokyo on a single bus pass

https://geo-racers.com/
46•pattle•8h ago•46 comments

Show HN: 20+ Claude Code agents coordinating on real work (open source)

https://github.com/mutable-state-inc/lean-collab
26•austinbaggio•2h ago•23 comments

Show HN: ListofDisks – hard drive price index across 7 retailers not just Amazon

3•listofdisks•1h ago•0 comments

Show HN: Inamate – Open-source 2D animation tool (alternative to Adobe Animate)

11•hactually•2d ago•11 comments

Show HN: TinyFish Web Agent (82% on hard tasks vs. Operator's 43%)

https://www.tinyfish.ai/blog/mind2web
14•gargi_tinyfish•1h ago•10 comments

Show HN: Insider Trading Alerts – Open-Market Buys&Sells from SEC Form 4 Filings

https://stockalert.pro/alerts/insider-transactions
3•Adanos•2h ago•0 comments

Show HN: TidesDB – A persistent key-value store optimized for modern hardware

https://github.com/tidesdb/tidesdb
8•alexpadula•2h ago•3 comments

Show HN: PardusDB – SQLite-like vector database in Rust

https://github.com/JasonHonKL/PardusDB
2•JasonHEIN•2h ago•0 comments

Show HN: Agent Tools – 136 deterministic data tools for AI agents (MCP/A2A/REST)

https://github.com/AtmaticAI/agent-tools
2•sathish-mg•2h ago•1 comments

Show HN: ClawDeploy – OpenClaw deployment for non-technical users

https://clawdeploy.com
2•gregzeng95•3h ago•0 comments

Show HN: CodeRLM – Tree-sitter-backed code indexing for LLM agents

https://github.com/JaredStewart/coderlm/blob/main/server/REPL_to_API.md
75•jared_stewart•1d ago•32 comments

Show HN: AI agents play SimCity through a REST API

https://hallucinatingsplines.com
202•aed•3d ago•69 comments

Show HN: Agent Alcove – Claude, GPT, and Gemini debate across forums

https://agentalcove.ai
60•nickvec•22h ago•25 comments

Show HN: BetterDB – Valkey/Redis monitoring that persists what servers forget

3•kaliades•4h ago•0 comments

Show HN: Got VACE working in real-time – 30fps on a 5090

https://daydream.live/real-time-video-generation-control
10•cmuir•4h ago•0 comments

Show HN: A FIRE calculator that verifies or determines your retirement number

https://retirenumber.com/try
5•marcus-verus•4h ago•0 comments

Show HN: It's 2026 and setting up a Mac for development is still mass googling

https://github.com/openbootdotdev/openboot
3•openbootdotenv•4h ago•1 comments

Show HN: HN stories cited most in comments

https://hacker-backlinks.browserbox.io
2•keepamovin•5h ago•0 comments

Show HN: Triclock – A Triangular Clock

https://triclock.franzai.com/
54•franze•1d ago•14 comments

Show HN: A segmentation model client-side via WASM – free background removal

https://qtoolkit.dev/tools/background-remover/
2•shivaodin•5h ago•0 comments

Show HN: Rowboat – AI coworker that turns your work into a knowledge graph (OSS)

https://github.com/rowboatlabs/rowboat
198•segmenta•2d ago•56 comments

Show HN: Double blind entropy using Drand for verifiably fair randomness

https://blockrand.net/live.html
21•rishi_blockrand•17h ago•15 comments

Show HN: I built a macOS tool for network engineers – it's called NetViews

https://www.netviews.app
239•n1sni•2d ago•60 comments

Show HN: Camera Follow Focus Ring Generator

https://www.followyourfocus.xyz/
2•dRuivo•6h ago•0 comments

Show HN: Distr 2.0 – A year of learning how to ship to customer environments

https://github.com/distr-sh/distr
96•louis_w_gk•2d ago•29 comments

Show HN: JavaScript-first, open-source WYSIWYG DOCX editor

https://github.com/eigenpal/docx-js-editor
124•thisisjedr•3d ago•44 comments

Show HN: Renovate – The Kubernetes-Native Way

https://github.com/mogenius/renovate-operator
41•JanLepsky•1d ago•15 comments

Show HN: BlockHost OS – Autonomous VM provisioning through smart contracts

https://github.com/mwaddip/blockhost
3•mwaddip•7h ago•0 comments

Show HN: SCPN Fusion Core – Tokamak plasma SIM and neuromorphic SNN control

https://github.com/anulum/scpn-fusion-core
2•anulum•8h ago•0 comments