Cuz right now it's way too slow... perform an action, then read the results, then wait for the next tool call, etc.
Watching Google's own Computer Use model successfully solve Google's own CAPTCHA (this was definitely Gemini doing the work, not a Browserbase feature) was pretty wild.
Here's a screenshot on my blog: https://simonwillison.net/2025/Oct/7/gemini-25-computer-use-...
more like continued employment.
On a serious note, What the fuck is happening in the world.
[Looks around and sees people not making APIs for everything]
Well that didn't work.
If you want to make something that can book every airline? Better be able to navigate a website.
It'll never happen, so companies need to deal with the reality we have.
New technology is slow due to risk aversion, it's very rare for people to just tear up what they already have to re-implement new technology from the ground up. We always have to shoe-horn new technology into old systems to prove it first.
There are just so many factors that get solved by working with what already exists.
Obviously much harder with UI vs agent events similar to the below.
Do you think callbacks are how this gets done?
But my bet - we will not deploy a single agent into any real environment without deterministic guarantees. Hooks are a means...
Browserbase with hooks would be really powerful, governance beyond RBAC (but of course enabling relevant guardrailing as well - "does agent have permission to access this sharepoint right now, within this context, to conduct action x?").
I would love to meet with you actually, my shop cares intimately about agent verification and governance. Soon to release the tool I originally designed for claude code hooks.
I think screenshots are a really good and robust idea. It bothers the more structured-minded people, but apps are often not built so well. They are built until the point that it looks fine and people are able to use it. I'm pretty sure people who rely on accessibility systems have lots of complaints about this.
Reminds me of an anecdote where Amazon invested howevermany personlives in building AI for Alexa, only to discover that alarms, music, and weather make up the large majority of things people actually use smart speakers for. They're making these things worse at their main jobs so they can sell the sizzle of AI to investors.
If such AI tools allow to automate this soulcrushing drudgery, it will be great. I know that you can technically script things Selenium, AutoHotkey whatnot. But you can imagine that it's a nonstarter in a regular office. This kind of tool could make things like that much more efficient. And it's not like it will then obviate the jobs entirely (at least not right away). These offices often have immense backlogs and are understaffed as is.
Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.
Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.
> I am back in the Google Sheet. I previously typed "Zip Code" in F1, but it looks like I selected cell A1 and typed "A". I need to correct that first. I'll re-type "Zip Code" in F1 and clear A1. It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and typed "Zip Code", but then maybe clicked A1 again.
Unfortunately it really struggled in the demos for me. It took nearly 18 attempts to click the comment link on the HN demo, each a few pixels off.
This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.
Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.
strangescript•1h ago
phamilton•1h ago
omkar_savant•13m ago
See this section: https://googledevai.devsite.corp.google.com/gemini-api/docs/...
And the repo has a sample setup for using the default computer use tool: https://github.com/google/computer-use-preview