There was a ton of this stuff before Chrome or WebKit even existed! Back in my day, we used Selenium and hated it. (I was lucky enough to start after Mercury...)
having to add ids to elements is one of those classic tradeoffs -- the alternative was to use css or xpath selectors, which can be even worse, maintenance-wise. i'm secretly hoping ai code-gen apps pumped out by things like Lovable or Claude Code automagically generate element test-ids and the tests for you and we never have to worry about it again.
that might not matter if the agent is re-finding the element between sessions anyway, but then you're paying a lookup cost (time + tokens) each time. compared to just using document.getelementbyid() on an explicit id.
we cant modify the dom to add IDs because we'd get detected by block-blockers very quickly. we're gradually trying to get rid of all DOM tampering entirely for that reason.
We had a complex user registration workflow that supported multiple nationalities and languages in a international bank website.
I setup selenium tests to detect breakages because it was almost humanly impossible to retest all workflows after every sprint.
It brought back sanity to the team and QA folks.
Tools that came after certainly benefitted from selenium lessons.
I still deeply appreciate these tools, even though I also find them a bit frustrating.
fun-fact: i've never used mercury. when i came up with "selenium" -- it was because a colleague saw an early demo and said it had the potential to "kill mercury". (spoiler alert!)
but in that moment, i hadn't heard of mercury before, so i had to google it. i then also spent a few extra cycles googling around for a "cure for mercury poisoning" just so i could continue the conversation with that colleague with a proto-dad-joke... and landed on a page about selenium supplements. things obviously got out of hand.
i didn't want to call the project "selenium". i preferred the name "check engine", but people started calling it "selenium" anyway. i only wish nice things for the mercury team -- the only thing i know about them is that hp acquired mercury for $4.5B. so i hope they blissfully don't care about me or my bad dad-jokes.
but again... i didn't realize there was an entire testing tools industry at that moment. all i knew was that i had a testing problem for my complicated web app -- and the consensus professional advice at the time was "yeah, no. don't use javascript in the browser -- it's too hard to test". (another spoiler.) also, (if i'm remembering correctly) mercury was ie/windows only... and i needed something that supported apple and mozilla/firefox. it felt like zero vendors at the time cared about anything that wasn't internet explorer or wasn't windows. so i had to chart my own course pretty quickly.
long story long: "you either die a hero, or you live long enough to see yourself become the villain" - harvey dent
Ha! Yeah, it's no worries at all, I think it's fine to not like things. Everybody is different. And for these sorts of things, it's kind of a "there are two kinds of tools, the ones people complain about, and the ones they don't use" sort of situation: if I didn't think it was valuable, I just wouldn't use it. But it's valuable enough to use despite the griping at times.
Thank you for the story!
> In the past 3 weeks I ported Playwright to run completely inside a Chrome extension without Chrome DevTools Protocol (CDP) using purely DOM APIs and Chrome extension APIs, I ported a TypeScript port of Browser Use to run in a Chrome extension side panel using my port of Playwright, in 2 days I ported Selenium ChromeDriver to run inside a Chrome Extension using chrome.debugger APIs which I call ChromeExtensionDriver, and today I'm porting Stagehand to also run in a Chrome extension using the Playwright port. This is following using VSCode's core libraries in a Chrome extension and having them drive a Chrome extension instead of an electron app.
The most difficult part is managing the lifecycle of Windows, Pages, and Frames and handling race conditions, in the case of automating a user's browser, where, for example, the user switches to another tab or closes the tab.
Personally, I have a browser extension running in my user/personal browser instance that my agent use (with rate-limits) in order to avoid all the captchas and blocks basically. Everything else I've tried ultimately ends up getting blocked. But then I'm also doing some heavy caching so most agent "browse" calls end up not even reaching out to the internet as it's finding and using stuff already stored locally.
1. There are 3,500,000,000 instances of Chrome desktop being used. [0]
2. A Chrome Extension can be installed with a click from the Chrome Web Store.
3. It is closer to the metal so runs extremely fast.
4. Can run completely contained on the users machine
5. It's just one user automating their web based workflows making it harder for bot protections to stop and with a human-in-the-loop any hang ups and snags can be solved by the human
6. Chrome extensions now have a side panel that is stationary in the window during navigation and tab switching. It is exactly like using the Cursor or VSCode side panel copilots
Some limitations:
1. Can't automate ChatGPT console because they check for user agent events by testing if the `isTrusted` property on event objects is true. (The bypass is using Chrome.debugger and the ChromeExtensionDriver I created.)
2. Can't take full page screen captions however it is possible to very quickly take visible scree captions of the viewport. Currently I scroll and stitch the images together if a full page screen is required. There are other APIs which allow this in a Chrome Extension and can capture video and audio but they require the user to click on some button so it isn't useful for computer vision automation. (The bypass is once again using the Chrome.debugger and ChromeExtensionDriver I created.)
3. Chrome DevTool Protocol allows intercepting and rewriting scripts and web pages before they are evaluated. With manifest v2 this was possible but they removed this ability in manifest v3 which we still hear about today with the adblock extensions.
I feel like with the limitations having a popup dialog that directs the user to do an action will work as long as it automates 98% of the user's workflows. Moreover, a lot of this automation should require explicit user acknowledgments before preceding.
We need the agent to be able to drive 1password, Privacy.com, etc. to request per-task credentials, change adblock settings, get 2fa codes, and more.
The holy grail really is CDP + control over browser launch flags + an extension bridge to get to the more ergonomic `chrome.*` APIs. We're also working on a custom Chromium fork.
1. Get all the things you want.
2. Can create as many 'browser context' personas as you want
3. Use the Electron app renderer for UI to manage profiles, proxies for each profile, automate making gmail accounts for each profile, ect.
4. Forgot, it is very nice using the `--load-extension=/path/to/extension` flag to ship chrome extension files inside the Electron app bundle so that the launched browser will have a cool copilot side panel.
> Extensions are ok but they have limitations too, for example you cannot use extensions to automate other extensions.
5. If you know the extension ids it is easy to set up communication between the two. I already drive a Chrome extension using VSCode's core libraries and it would be a week or two of work to implement a light port of the VSCode host extension API but for a Chrome extension. Nonetheless, I'd rather have an Electron app to manage extensions the same way a VSCode does.
Shipping a whole electron app is not a priority at the moment though, our revenue comes from cloud API users, and there we only need our custom chrome fork, no point messing with electron and extension bridges when we can add custom CDP commands to talk to `chrome.*` APIs directly.
Karma like approaches are where I’m at (execute in the browser)
Why? I would think any cross-process communication through the CDP websocket would have imperceptible overhead compared to what already takes long in the browser: a ton of HTTP I/O
What is Karma? What are you executing in the browser?
Cdp does add a good chunk of latency. Depends on what your threshold is.
An image grab is around 60ms and a snapshot can range from 40ms -> 500ms
The latency is pure data movement. It's like the difference of using ram vs ssd vs data from the internet.
Good luck guys!
That isn't how you launch a product.
By the very nature of how Playwright is built we can't contribute to it - it runs inside a JS subprocess and does not expose a bunch of CDP apis that we NEED (for example to make cross origin iframes work).
Personally have not found their team to be the easiest to work with on Github. I would've loved to use puppeteer instead, their team is quite reasonable but they abandoned their python bindings and we want to stay in python.
related side-note: have you had to interact with the core chrome / cdp devs?
I have not spoken to people that work directly on CDP yet, but I believe we have a call with them soon!
i'm just stubborn enough to find out, though. and i still have a few contacts at the googleplex...
I build something like an automation system pure cdp to shave ms off. But I'm a real time user interaction system plus automation not pure ai automation.
Doesn't make much sense to shave ms when an LLM call is hundreds of ms ans that's the only "user"
This post is like saying Grafana and not mentioning Nagios
Selenium offered headless mode and integrated with 3rd party providers like BrowserStack, which ran acceptance tests in parallel in the cloud. It seems like what browser-use.com is doing is a modern day version with many more features & adaptability.
i like that there are new startups in the space, though. things were getting pretty stale and uninspired.
Thank you for building Selenium.
There are no extra checks needed it's by a significant margin the most reliable method to see current state.
I run snapshot at 10-20fps though plus the same for parallel image capture.
I've been wondering if I should release just this part of my system open source seems like I'm not alone in how complex this all is.
I could launch yet another automation framework!
it's a bummer, but also a market reality... the best way to get more devs to care about non-chrome browsers is to get more people to use non-chrome browsers. easier said than done, though.
arm32•5h ago
Now that I got my snarky remark out of the way:
Puppeteer uses CDP under the hood. Just use Puppeteer.
haolez•5h ago
odo1242•4h ago
I don't know how well it would work for that use-case, but I've used it before, for example, to write a web-crawler that could handle client-side rendering.
nikisweeting•3h ago
boredtofears•5h ago
We're currently using Cypress for some automated testing on a recent project and its extremely brittle. Considering moving to playwright or puppeteer but not sure if that will fix the brittleness.
rising-sky•5h ago
arm32•5h ago
gregpr07•5h ago
epolanski•3h ago
nikisweeting•3h ago
nikisweeting•3h ago
nikisweeting•4h ago
epolanski•3h ago
nikisweeting•3h ago
arm32•2h ago
hugs•3h ago