Computer Use Is 45x More Expensive Than Structured APIs

https://reflex.dev/blog/computer-use-is-45x-more-expensive-than-structured-apis/

77•palashawas•1h ago

Comments

taormina•55m ago

The interface designed for humans is poor for AI needs? And the interface designed for programmatic use is easier for the AI to use? In other news, the sky is blue and water is wet.

palashawas•50m ago

Yep, everyone knows computer use is more expensive. This is about quantifying the gap

sudb•53m ago

I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.)

There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.

1. https://github.com/SawyerHood/dev-browser

palashawas•45m ago

Interesting! I'll play around with agent-browser and update this article if anything comes up

cjbarber•53m ago

I think of computer use as like last mile delivery. APIs and bash and such are the efficient logistics networks. Both have different benefits. Obviously, use the efficient methods when you can.

svnt•49m ago

> This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything.

> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.

This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?

Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.

palashawas•35m ago

This is a fair point.

The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares

Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now

aurareturn•47m ago

In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.

I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.

mtoner23•46m ago

Openai should not design a phone... They should try making money first

sophacles•40m ago

Nonsense. Don't you know how bubbles work? Everyone does massive rushes for all the low hanging and medium hanging fruit. The the bubble pops and the randomized carnage of companies big and small being destroyed is sifted through by the next wave of companies actually intended to make money.

The good ideas and the bad ideas don't signal success in a bubble, nor does making money or not. Its random and any notion of "this was a good business model and that was bad" is post-hoc rationalization. The number of people who make fun of pets.com but order from chewy.com is a prime example of this.

joshstrange•43m ago

> I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.

This is not going to happen, or if it does it will just be Android (like Samsung reskins/modifies it) and it will certainly use Google Play Services.

reorder9695•40m ago

Presumably on Linux at least apps could just expose a DBus API? The machinery for this is already in place as far as I can tell.

lazide•38m ago

This is like insisting - after the problem turns out to be harder than thought - that the worlds roads need to be completely redone to make them self driving friendly, so self driving can work.

Isn’t the whole ‘promise’ of AI that it doesn’t need any of those things?

tikhonj•34m ago

Everything exposed programmatically would have been great even without agents—the NixOSes and Emacses of the world show just how amazing a fully flexible and programmable world would be—but I'm glad that the advent of AI is getting people invested in this vision :P

QuercusMax•30m ago

Lots of apps actually do have all their functionality exposable via an API - but it's an internal API that's hidden from the user.

planb•30m ago

This will not happen. None of the existing apps people use daily on their phones have any incentive to support this. Social media wants the people to doomscroll, shopping apps and booking sites want to use their own dark patterns to make people believe they get a special discount if they buy _now_ and everything else just wants users to see the ads. Why on earth would they offer convenient hooks for AI chatbots?

input_sh•10m ago

It's even more fascinatingly dumb to have this discussion like 2 or so years after every major platform decided to kill any notion of 3rd party clients they used to support.

Yes, in an ideal world, that'd be great for both humans and LLMs, but we are about as far from that ideal world as we could be. You can't even do some of the "advanced actions" as a human with human-level reflexes without encountering a captcha, but sure, all of a sudden, everyone will just decide to make their bread and butter that is data easier to explore via an LLM.

jackphilson•7m ago

because the social media sites that do will outcompete once people get personal AI coaches that tell them to use technology that is better for them.

donaldjbiden•4m ago

How is an AI posting on your social media better for you?

throwaway27448•29m ago

We have a much better chance of an ai-addressable Harmony OS version than of OpenAI making a serious competitor.

dummydummy1234•27m ago

Why not use the same acc disability features?

CodingJeebus•24m ago

One of the most seductive (and destructive) forces in software is the desire to rewrite from scratch because rewrites never, ever, ever go as planned. With AI, we're now thinking it's a good idea to rewrite the entire platform from the ground-up. Wild.

convolvatron•14m ago

except every single piece of progress that we have is the result of trying to do things a different way. so unless you really think we've reached the pinnacle of operating system design, there has to be some room for this?

dist-epoch•23m ago

The future is "dark OSs" - OSes with no human users.

pmontra•22m ago

I still have to understand what my AI agents could do that I don't want to do myself. Buy stuff? No thanks, I want to see what I buy. I think that they are 99% a solution in search of a problem.

sbrother•9m ago

Same. Well the biggest thing I don't want to do that they could help with is work. But in the cases where it can do that for me, there's no world where that benefit goes to me rather than my employer.

switchbak•17m ago

"In an agentic world, the OS needs to be completely rethought" - if AI is progressing as fast as we think it is, I don't think we'll be interested in waiting for the world to rebuild all the legacy tooling from the OS up. For new stuff, that'd be great.

I imagine the AIs will get a lot better at intercepting things at an intermediate level - API calls under the hood, etc. Probably much better (and cheaper) vision abilities, and perhaps even deeper integration into the machine code itself. It's really hard to anticipate what an advanced model will be capable of 5 years from now.

shiandow•6m ago

Ah yes. The trains everywhere approach to self driving cars.

donaldjbiden•5m ago

We used to have this. It was called OLE Automation.

moralestapia•45m ago

This is obvious. The problem is that not everything has an API, while everything has a human-oriented UI.

palashawas•30m ago

Right - we did this benchmark because we launched a plugin that makes APIs programmatically from an app's human-oriented UI (from the event handlers, to be specific). So any app that has a human-oriented UI now has an API.

The benchmark is a more generally interesting part of the launch materials, so I figured it had its own separate home here.

Havoc•38m ago

Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.

I can see the appeal in pixel route given universality but wow that seems ugly on efficiency

QuercusMax•18m ago

imagine, if you will, that we had a windowing system that's built on Postscript... lots of folks thought it was a super awesome idea, and built NeXTSTEP around it. https://en.wikipedia.org/wiki/Display_PostScript

or even one based on PDF like OSX: https://en.wikipedia.org/wiki/Quartz_2D

donaldjbiden•3m ago

Wayland only has pixels. It was designed to get rid of all the X11 cruft.

_boffin_•34m ago

What i don't understand about "computer use" is why they're not just grabbing the window handles and storing them to determine what should be clicked after the first few iterations of using that a specific application. if a new case / path / whatever is found, drop back to screen grabbing and bounding boxes and then figure the handles that are there and store after.

idk.. not really thought out too much, but has to be better

faangguyindia•32m ago

I saw Codex was screenshotting, then clicking around. I just stopped it and never used that again.

Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.

I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...

I developed all this on $20 codex sub.

ceejayoz•31m ago

Claude does this too, with the Chrome extension.

It breaks like 80% of the time for me, and it's incredibly slow. Having it use Playwright (bonus: can test in FF/Saf too) was a big improvement.

embedding-shape•29m ago

I think it's the third or forth time I see you bragging about HN how many apps you're able to develop with AI now. Care to link any of them, especially where we can see the actual code that you've produced here? Without being able to see actual results, I'm not sure what you want people to take away from your repeated comments.

faangguyindia•19m ago

I only write here because people are spreading doomerism here with AI and I am excited about future.

Well I am competing with geoip provider like maxmind.

I developed custom traceroute and ping service to geolocate IPs with very high accuracy beating products like digital element, maxmind, ipinfo

These companies have huge teams. But my 3 people company already beat them.

Code doesn't matter much, it's not an opensource project.

My free app is http://macrocodex.app which I've developed along with a fitness coach.

I am currently beating companies with 20-30 developers and closing more deals while having 1/10th of the staff.

I am simply very excited about all this.

Nobody cares show you solve the problem, or if your code is ugly. As long as it's reliable and without downtime, you aren't breaking things and causing your customer headache, you are winning.

Even before AI, bad code existed. Not every company had 10x developer writing beautiful idiomatic rust code.

AI is just a tool, people who are trying to generate whole codebase with it are doing something very wrong. You can write code faster with AI provided you understand its strength and weakness

dist-epoch•24m ago

It doesn't matter.

Electron uses 10x more RAM than regular apps. But it's so convenient.

Python is 100x slower than C. It's in the top 3 of languages now.

Worse but more convenient always wins.

gowld•22m ago

Confusing title? "Computer Use" is actually "Browser vision"?

antves•21m ago

I think one main point is that not all "computer use" is the same, the harness and agentic experience matters a lot. A poorly designed API experience can actually be _less_ efficient than a well designed browser or computer use experience

In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)

At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered

We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be

overgard•15m ago

I've been thinking of things I'd want an agent for recently. The problem is, everything I think of is something that requires using quite a few different websites, saving a lot of data securely, and working with a lot of sensitive accounts (my email, etc.)

The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:

- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.

- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.

- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.

Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.

rootcage•15m ago

The best use cases I've seen for computer/browser use is for legacy SaaS/Software. For example, hotels use archaic Property Management Systems (PMS) and they're required by corporate to use it and pay for it. These companies can barely keep the product alive, they definitely aren't incentivized to maintain an API. In such a case browser use agent seems to be the best (only) way.

noprocrasted•11m ago

Wouldn't using a coding agent to build a screenscraper be better?

merlindru•7m ago

I'm building something that fixes this exact problem[1].

The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.

The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`

Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.

[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet

janalsncm•6m ago

Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds.

The only reason you wouldn’t choose an API is if it wasn’t viable.

"Big Boy" Power for Every User

Charts = Tables

Should I continue this project ? (Being able to change AI harness)

Show HN: AgentSearch-Self-hosted search API for AI agents and optional Tor stack

GitHub incident May 5, 2026

ServiceNow just unveiled an AI workforce that can run your company

U.S. ramps up frontier AI testing as White House pivots toward safety

GitHub Action Runner Alternatives

Apple to let users choose rival AI models across iOS 27 features

Should You Be Token-Maxxing?

Ask HN: How do you pilot a service company full of AI agents?

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Moving to mainframe can be cheaper than sticking with VMware

AI, your way: introducing the Poolside Platform

10 years helping Rails devs reach App Store. Today, someone shipped without me

Richard Dawkins concludes AI is conscious, even if it doesn't know it

Oil 101, Second Edition

An Open Letter to Jay Bhattacharya

Show HN: I built a spoiler-free WWE dashboard for 2001-2019 with 15,000 matches

PostHog Code

Nostr Mail – Nostr Mail Documentation

Spaces Protocol May 2026 Update

Orbee chat: your name, your people, your rules

Changes in Hospital Finance, Operations and Quality After Management Consultants

DigitalOcean's NYC region looked fine – until we ran it again

Understand EOB and medical bill text locally in Chrome

OpenAI smartphone leak reveals next-gen chipset and more details

Detecting silent LLM agent degradation before users do

UALink AI Accelerator Spec Maintains Rapid Update Pace

The exotic particles that could break the Standard Model