There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.
> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?
Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.
The models frequently failed for many reasons on earlier runs, and the browser-use prompt ended up being pretty granular. I'll add a couple of runs that include a scroll instruction to the repo today and see how that compares
Pretty hard to guess what Anthropic trained sonnet on, but general multimodals are what people are using to drive similar tools today, whether GUI-trained or not, so the comparison still holds, for now
I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
The good ideas and the bad ideas don't signal success in a bubble, nor does making money or not. Its random and any notion of "this was a good business model and that was bad" is post-hoc rationalization. The number of people who make fun of pets.com but order from chewy.com is a prime example of this.
This is not going to happen, or if it does it will just be Android (like Samsung reskins/modifies it) and it will certainly use Google Play Services.
Isn’t the whole ‘promise’ of AI that it doesn’t need any of those things?
Yes, in an ideal world, that'd be great for both humans and LLMs, but we are about as far from that ideal world as we could be. You can't even do some of the "advanced actions" as a human with human-level reflexes without encountering a captcha, but sure, all of a sudden, everyone will just decide to make their bread and butter that is data easier to explore via an LLM.
The modern javascript ecosystem is a perfect example of what happens when everyone tries to rebuild from scratch and it's a nightmare.
I imagine the AIs will get a lot better at intercepting things at an intermediate level - API calls under the hood, etc. Probably much better (and cheaper) vision abilities, and perhaps even deeper integration into the machine code itself. It's really hard to anticipate what an advanced model will be capable of 5 years from now.
I could imagine an AI future where agentic shopping companies who promise me the best deal are pitted against Walmart and Amazon, trying to algorithmically squeeze me for $2 more- just two bots playing a cat and mouse game to save me a few bucks.
For some reason a lot of tech ends up in these antagonistic monopolies- Apple wants to sell privacy aware devices as a product feature, Google wants give you mail and maps, but sell your data. Despite any appearances neither give a shit about you, even if you benefit from the dynamic.
It’s why we did this benchmark :) - reflex team member
The benchmark is a more generally interesting part of the launch materials, so I figured it had its own separate home here.
I can see the appeal in pixel route given universality but wow that seems ugly on efficiency
or even one based on PDF like OSX: https://en.wikipedia.org/wiki/Quartz_2D
idk.. not really thought out too much, but has to be better
Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.
I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...
I developed all this on $20 codex sub.
It breaks like 80% of the time for me, and it's incredibly slow. Having it use Playwright (bonus: can test in FF/Saf too) was a big improvement.
Well I am competing with geoip provider like maxmind.
I developed custom traceroute and ping service to geolocate IPs with very high accuracy beating products like digital element, maxmind, ipinfo
These companies have huge teams. But my 3 people company already beat them.
Code doesn't matter much, it's not an opensource project.
My free app is http://macrocodex.app which I've developed along with a fitness coach.
I am currently beating companies with 20-30 developers and closing more deals while having 1/10th of the staff.
I am simply very excited about all this.
Nobody cares show you solve the problem, or if your code is ugly. As long as it's reliable and without downtime, you aren't breaking things and causing your customer headache, you are winning.
Even before AI, bad code existed. Not every company had 10x developer writing beautiful idiomatic rust code.
AI is just a tool, people who are trying to generate whole codebase with it are doing something very wrong. You can write code faster with AI provided you understand its strength and weakness
Heh, you're in for a rude awakening, sometime in the future :) But I won't spoil the surprise, you clearly have made up your mind about what to focus on.
> My free app is http://macrocodex.app which I've developed along with a fitness coach.
Crazy, this app you've run for ~1-2 months has 10K active users already, even though there is zero info about who runs it, zero reviews, and says "Download on the App Store" on the landing page even though you then ask people to use the web app, impressive.
I don't think anyone said using AI can't produce a ton of code really quickly, and no one is finding that difficult to manage either. But most of us software engineers are trying to build long-lasting codebases with AI too, then "less === better" typically, so it's not about being able to spit out features as fast as possible, but avoid the evergrowing codebase from collapsing on top of itself, and each prompt not getting slower and slower, but as fast as on a greenfield project.
Sounds like you've found the holy grail of being able to avoid that, kudos if so. Judging by you giving zero care to how the design and architecture actually is, I kind of find that hard to believe. But, if it works for you, it works for you, not up to me or others to dictate how you build stuff, hope you enjoy it, however you build stuff :)
Electron uses 10x more RAM than regular apps. But it's so convenient.
Python is 100x slower than C. It's in the top 3 of languages now.
Worse but more convenient always wins.
In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)
At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered
We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be
The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:
- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.
- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.
- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.
Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.
1. things you wouldn’t otherwise bother doing
2. things where it otherwise would get stuck iterating on hacky workarounds doomed to fail
“Reverse engineer this app/site so we can do $common_task in one click”, “by the way, I’m logged in to $developer_portal, so try @Browser Use if you’re stuck”, etc.
I just had Codex pull user flows out of a site I’m working on and organize them on a single page. It found 116. I went in and annotated where I wanted changes, and now it’s crunching away fixing them all. Then it’ll give me an updated contact sheet and I can do a second pass.
I’d never do this sort of quality pass manually and instead would’ve just fixed issues as they came up, but this just runs in the background and requires 15 minutes of my time for a lot of polish.
The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.
The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`
Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.
[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet
i so far haven't found any application that doesn't.
all you're able to get out, as far as i can tell, is the length of the entered password.
i tend to think of invoke as "an API over macOS apps" tho... doesn't `invoke finder shareAndCopyLink` read very nicely? :P
The only reason you wouldn’t choose an API is if it wasn’t viable.
When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.
I don't think any new app should ever be specifically designed for AI to interact with them through computer use
taormina•1h ago
palashawas•1h ago