Gemini 2.5 Computer Use model

https://blog.google/technology/google-deepmind/gemini-computer-use-model/

194•mfiguiere•2h ago

Comments

strangescript•1h ago

I assume its tool calling and structured output are way better, but this model isn't in Studio unless its being silently subbed in.

phamilton•1h ago

Just tried it in an existing coding agent and it rejected the requests because computer tools weren't defined.

omkar_savant•13m ago

We can definitely make the docs more clear here but the model requires using the computer_use tool. If you have custom tools, you'll need to exclude predefined tools if they clash with our action space.

See this section: https://googledevai.devsite.corp.google.com/gemini-api/docs/...

And the repo has a sample setup for using the default computer use tool: https://github.com/google/computer-use-preview

xnx•1h ago

I've had good success with the Chrome devtools MCP (https://github.com/ChromeDevTools/chrome-devtools-mcp) for browser automation with Gemini CLI, so I'm guessing this model will work even better.

arkmm•1h ago

What sorts of automations were you able to get working with the Chrome dev tools MCP?

odie5533•52m ago

Not OP, but in my experience, Jest and Playwright are so much faster that it's not worth doing much with the MCP. It's a neat toy, but it's just too slow for an LLM to try to control a browser using MCP calls.

atonse•8m ago

Yeah I think it would be better to just have the model write out playwright scripts than the way it's doing it right now (or at least first navigate manually and then based on that, write a playwright typescript script for future tests).

Cuz right now it's way too slow... perform an action, then read the results, then wait for the next tool call, etc.

iLoveOncall•1h ago

This has absolutely nothing in common with a model for computer use... This uses pre-defined tools provided in the MCP server by Google, nothing to do with a general model supposed to work for any software.

cryptoz•1h ago

Computer Use models are going to ruin simple honeypot form fields meant to detect bots :(

layman51•25m ago

You mean the ones where people add a question that is like "What is 10+3?"

jebronie•8m ago

I just tried to submit a contact form with it. It successfully solved the ReCaptcha but failed to fill in a required field and got stuck. We're safe.

phamilton•1h ago

It successfully got through the captcha at https://www.google.com/recaptcha/api2/demo

siva7•1h ago

probably because its ip is coming from googles own subnet

asadm•1h ago

isnt it coming from browserbase container?

ripbozo•35m ago

Interestingly the IP I got when prompting `what is my IP` was `73.120.125.54` - which is a residential comcast IP.

martinald•21m ago

Looks like browserbase has proxies, which will be often residential IPs.

jampa•1h ago

The automation is powered through Browserbase, which has a captcha solver. (Whether it is automated or human, I don't know.)

simonw•52m ago

I ran the "Go to Hacker News and find the most controversial post from today, then read the top 3 comments and summarize the debate." demo on https://gemini.browserbase.com/ and it did the same thing, but for the CAPTCHA that www.google.com served it before it could run the search for "hacker news"!

Watching Google's own Computer Use model successfully solve Google's own CAPTCHA (this was definitely Gemini doing the work, not a Browserbase feature) was pretty wild.

Here's a screenshot on my blog: https://simonwillison.net/2025/Oct/7/gemini-25-computer-use-...

pants2•9m ago

Interesting that they're allowing Gemini to solve CAPTCHAs because OpenAI's agent detects and forces user-input for CAPTCHAs despite being fully able to solve them

dude250711•1h ago

Have average Google developers been told/hinted that their bonuses/promotions will be tied to their proactivity in using Gemini for project work?

peddling-brink•1h ago

> bonuses/promotions

more like continued employment.

teaearlgraycold•1h ago

I know there was a memo telling Googlers they are expected to use AI at work and it’s expected for their performance to increase as a result.

dude250711•1h ago

The HBO's Silicon Valley ended way too soon. The plot pretty much writes itself.

Imustaskforhelp•44m ago

Don't worry Maybe someone will create AI slop for this on Sora 2 or the likes (this was satire)

On a serious note, What the fuck is happening in the world.

password54321•1h ago

doesn't seem like it makes sense to train AI around human user interfaces which aren't really efficient. It is like building a mechanical horse.

pixl97•1h ago

Right, let's make APIs for everything...

[Looks around and sees people not making APIs for everything]

Well that didn't work.

odie5533•48m ago

Every website and application is just layers of data. Playwright and similar tools have options for taking Snapshots that contain data like text, forms, buttons, etc that can be interacted with on a site. All the calls a website makes are just APIs. Even a native application is made up of WinForms that can be inspected.

pixl97•30m ago

Ah, so now you're turning LLMs into web browsers capable of parsing Javascript to figure out what a human might be looking at, let's see how many levels deep we can go.

measurablefunc•19m ago

Just inspect the memory content of the process. It's all just numbers at the end of the day & algorithms do not have any understanding of what the numbers mean other than generating other numbers in response to the input numbers. For the record I agree w/ OP, screenshots are not a good interface for the same reasons that trains, subways, & dedicates lanes for mass transit are obviously superior to cars & their associated attendant headaches.

wahnfrieden•1h ago

It's not about efficiency but access. Many services do not provide programmatic access.

CuriouslyC•1h ago

We're training natural language models to reason by emulating reasoning in natural language, so it's very on brand.

bonoboTP•36m ago

It's on the brand of stuff that works. Expert systems and formal symbolic if-else, rules based reasoning was tried, it failed. Real life is messy and fat-tailed.

michaelt•1h ago

In my country there's a multi-airline API for booking plane tickets, but the cheapest of economy carriers only accept bookings directly on their websites.

If you want to make something that can book every airline? Better be able to navigate a website.

odie5533•48m ago

You can navigate a website without visually decoding the image of a website.

bonoboTP•37m ago

Except if its a messy div soup with various shitty absolute and relative pixel offsets where the only way to know what refers to what is by rendering it and using gestalt principles.

TulliusCicero•59m ago

This is just like the comments suggesting we need sensors and signs specifically for self-driving cars for them to work.

It'll never happen, so companies need to deal with the reality we have.

jklinger410•58m ago

Why do you think we have fully self driving cars instead of just more simplistic beacon systems? Why doesn't McDonald's have a fully automated kitchen?

New technology is slow due to risk aversion, it's very rare for people to just tear up what they already have to re-implement new technology from the ground up. We always have to shoe-horn new technology into old systems to prove it first.

There are just so many factors that get solved by working with what already exists.

layman51•41m ago

About your self-driving car point, I feel like the approach I'm seeing is akin to designing a humanoid robot that uses its robotic feet to control the brake and accelerator pedals, and its hand to move the gear selector.

bonoboTP•39m ago

Yeah, that would be pretty good honestly. It could immediately upgrade every car ever made to self driving and then it could also do your laundry without buying a new washing machine and everything else. It's just hard to do. But it will happen.

iAMkenough•20m ago

I could add self-driving to my existing fleet? Sounds intriguing.

golol•16m ago

If we could build mechanical horses they wiuld be absolutely amazing!

ramoz•1h ago

This will never hit a production enterprise system without some form of hooks/callbacks in place to instill governance.

Obviously much harder with UI vs agent events similar to the below.

https://docs.claude.com/en/docs/claude-code/hooks

https://google.github.io/adk-docs/callbacks/

peytoncasper•16m ago

Hi! I work in identity products at Browserbase. I’ve spent a fair amount of time lately thinking about how to layer RBAC across the web.

Do you think callbacks are how this gets done?

ramoz•7m ago

Disclaimer: Im a cofounder, we focus critical spaces with AI. Also i was the feature request for claude code hooks.

But my bet - we will not deploy a single agent into any real environment without deterministic guarantees. Hooks are a means...

Browserbase with hooks would be really powerful, governance beyond RBAC (but of course enabling relevant guardrailing as well - "does agent have permission to access this sharepoint right now, within this context, to conduct action x?").

I would love to meet with you actually, my shop cares intimately about agent verification and governance. Soon to release the tool I originally designed for claude code hooks.

CuriouslyC•1h ago

I feel like screenshots should be the last thing you reach for. There's a whole universe of data from accessibility subsystems.

ekelsen•46m ago

and all sorts of situations where they don't work. When they do work it's great, but if they don't and you rely on them, you have nothing.

bonoboTP•40m ago

The rendered visual layout is designed in a way to be spatially organized perceptually to make sense. It's a bit like PDFs. I imagine that the underlying hierarchy tree can be quite messy and spaghetti, so your best bet is to use it in the form that the devs intended and tested it for.

I think screenshots are a really good and robust idea. It bothers the more structured-minded people, but apps are often not built so well. They are built until the point that it looks fine and people are able to use it. I'm pretty sure people who rely on accessibility systems have lots of complaints about this.

whinvik•1h ago

My general experience has been that Gemini is pretty bad at tool calling. The recent Gemini 2.5 Flash release actually fixed some of those issues but this one is Gemini 2.5 Pro with no indication about tool calling improvements.

TIPSIO•56m ago

Painfully slow

John7878781•21m ago

That doesn't matter so much when it can happen in the background.

Oras•49m ago

It is actually quite good at following instructions, but I tried clicking on job application links, and since they open in a new window, it couldn't find the new window. I suppose it might be an issue with BrowserBase, or just the way this demo was set up.

MiguelG719•21m ago

are you running into this issue on gemini.browserbase.com or the google/computer-use-preview github repo?

mianos•48m ago

I sure hope this is better than pathetically useless. I assume it is to replace the extremely frustrating Gemini for Android. If I have a bluetooth headset and I try "play music on Spotify" it fails about half the time. Even with youtube music. I could not believe it was so bad so I just sat at my desk with the helmet on and tried it over and over. It seems to recognise the speech but simply fails to do anything. Brand new Pixel 10. The old speech recognition system was way dumber but it actually worked.

bsimpson•42m ago

I was riding my motorcycle the other day, and asked my helmet to "call <friend>." Gemini infuriatingly replied "I cannot directly make calls for you. Is there something else I can help you with?" This absolutely used to work.

Reminds me of an anecdote where Amazon invested howevermany personlives in building AI for Alexa, only to discover that alarms, music, and weather make up the large majority of things people actually use smart speakers for. They're making these things worse at their main jobs so they can sell the sizzle of AI to investors.

mosura•47m ago

One of the slightly buried stories here is BrowserBase themselves. Great stuff.

bonoboTP•46m ago

There are some absolutely atrocious UIs out there for many office workers, who spend hours clicking buttons opening popup after popup clicking repetitively on checkboxes etc. E.g. entering travel costs or somesuch in academia and elsewhere. You have no idea how annoying that type of work is, you pull out your hair. Why don't they make better UIs, you ask? If you ask, you have no idea how bad things are. Because they don't care, there is no communication, it seems fine, the software creators are hard to reach, the software is approved by people who never used it and decide based on gut feel, powerpoints and feature tickmarks. Even big name brands are horrible at this, like SAP.

If such AI tools allow to automate this soulcrushing drudgery, it will be great. I know that you can technically script things Selenium, AutoHotkey whatnot. But you can imagine that it's a nonstarter in a regular office. This kind of tool could make things like that much more efficient. And it's not like it will then obviate the jobs entirely (at least not right away). These offices often have immense backlogs and are understaffed as is.

numpad0•39m ago

How big are Gemini 2.5(Pro/Flash/Lite) models in parameter counts, in experts' guesstimation? Is it towards 50B, 500B, or bigger still? Even Flash feels smart enough for vibe coding tasks.

thomasm6m6•25m ago

2.5 Flash Lite replaced 2.0 Flash Lite which replaced 1.5 Flash 8B, so one might suspect 2.5 Flash Lite is well under 50B

jcims•23m ago

(Just using the browserbase demo)

Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.

Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.

iAMkenough•23m ago

Not great at Google Sheets. Repeatedly overwrites all previous columns while trying to populate new columns.

> I am back in the Google Sheet. I previously typed "Zip Code" in F1, but it looks like I selected cell A1 and typed "A". I need to correct that first. I'll re-type "Zip Code" in F1 and clear A1. It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and typed "Zip Code", but then maybe clicked A1 again.

asadm•22m ago

This is great. Now I want it to run faster than I can do it.

omkar_savant•21m ago

Hey - I'm on the team that launched this. Please let me know if you have any questions!

SoKamil•12m ago

How are you going to deal with reCAPTCHA and ad impressions? Sounds like a conflict of interest.

martinald•14m ago

Interesting, seems to use 'pure' vision and x/y coords for clicking stuff. Most other browser automation with LLMs I've seen uses the dom/accessibility tree which absolutely churns through context, but is much more 'accurate' at clicking stuff because it can use the exact text/elements in a selector.

Unfortunately it really struggled in the demos for me. It took nearly 18 attempts to click the comment link on the HN demo, each a few pixels off.

dekhn•6m ago

Many years ago I was sitting at a red light on a secondary road, where the primary cross road was idle. It seemed like you could solve this using a computer vision camera system that watched the primary road and when it was idle, would expedite the secondary road's green light.

This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.

Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.

AaronAPU•6m ago

I’m looking forward to a desktop OS optimized version so it can do the QA that I have no time for!

Agentic workflow integrating any REST API into a graph using GraphOS MCP Tools

Is the "Nintendo Classics" collection good value?

Month MiniPC Mini-Review: Minisforum AI X1 Pro

Easy Claude Code devcontainer workflows

Joint statement of scientists and researchers on the EU Chat Control regulation

Closer to production quality Python notebooks with `marimo check`

Become unbannable from your email

Katherina Lynn Faked Her Way into Yale. Then She Got Expelled

Cold war power play: how the Stasi got into computer games

Let's Encrypt – Ten Years of Community Support

Tinderbox, the Tool for Notes: Hookmark's Partner of the Month for October

Social exposome and brain health outcomes of dementia across Latin America

How to Figure Out What You're Not Good At

Tesla releases new more affordable Model 3/Y that cost $2k+ more than last week

Routeshuffle

Show HN: Gabriel Operator – record a task once, rerun it as a browser agent

TypeScript Is Like C#

Retunnel – free ngrok alternative (Python)

California passes law to reduce volume of commercials on streaming services

From Barking Hellhounds to AI Slop: What Electronic Music Foretells About GenAI

Development Gets Better with Age

You don't need an AI agent framework, or why frameworks are the new Juicero

UX Entropy: Zoom's arc from hero to hulk

The new and best way to execute Java/Kotlin

Amazon EC2 Instance Attestation

Christie's auction – The Pascaline, or "arithmetical machine"

Generate landing page for your mobile app in minutes

Japanese platform DLsite unveils payment system after Visa and Mastercard bans

Boox's next smartphone-sized e-reader has a color screen and a stylus

A MCP server to find information about standards (finalized and draft)