https://learn.microsoft.com/en-us/dotnet/api/microsoft.visua...
Used it to write programs that would run in the background & spook my friends by "typing" quotes from movies at random times on their computer.
It’s how I accidentally learned the Win32 API
Q: How do you identify the AOL window? A: Look for an app with titlebar = "America[space][space]Online"
CV and direct mouse/kb interactions are the “base” interface, so if you solve this problem, you unlock just about every automation usecase.
(I agree that if you can get good, unambiguous, actionable context from accessibility/automation trees, that’s going to be superior)
Thankfully the spec as provided by MDN for minimal functionality is well spelled out and our company values meeting accessibility requirements, so we will revisit and flesh out what we’re missing.
Also I wanna give props (ha) to the Storybook team for bringing accessibility testing into their ecosystem as it really does help to have something checking against our implementations.
It was a somewhat naive attempt, but it didn't look like they performed well without perhaps much additional work. I wonder if there are models that do much better, maybe whatever OpenAI uses internally for operator, but I'm not clear how bulletproof that one is either.
These models weren't trained specifically for UI object detection and grounding, so, it's plausible that if they were trained on just UI long enough, they would actually be quite good. Curious if others have insight into this.
I guess I can answer, "yes I think so."
yodon•4mo ago
Preferably one that is similarly able to understand and interact with web page elements, in addition to app elements and system elements.
CharlesW•4mo ago
For web page elements, you could drive the browser via `do JavaScript` or use a dedicated browser MCP (Chrome DevTools MCP, Playwright MCP).
nikisweeting•4mo ago
https://github.com/browser-use/browser-use