Using the UIA tree as the currency for LLMs to reason over always made more sense to me than computer vision, screenshot based approaches. It’s true that not all software exposes itself correctly via UIA, but almost all the important stuff does. VS code is one notable exception (but you can turn on accessibility support in the settings)
freedomben•42m ago
Agreed. I've noticed ChatGPT when parsing screenshots writes out some Python code to parse it, and at least in the tests I've done (with things like, "what is the RGB value of the bullet points in the list" or similar) it ends up writing and rewriting the script five or so times and then gives up. I haven't tried others so I don't know if their approach is unique or not, but it definitely feels really fragile and slow to me
electroly•57m ago
Looks awesome. I've attempted my own implementation, but I never got it to work particularly well. "Open Notepad and type Hello World" was a triumph for me. I landed on the UIA tree + annotated screenshot combination, too, but mine was too primitive, and I tried to use GPT which isn't as good at image tasks as Gemini as used here. Great job!
tiahura•16m ago
LLM’s do a pretty good job of using pywin32 for programs that support COM like office.
yodon•2h ago
Preferably one that is similarly able to understand and interact with web page elements, in addition to app elements and system elements.
CharlesW•1h ago
For web page elements, you could drive the browser via `do JavaScript` or use a dedicated browser MCP (Chrome DevTools MCP, Playwright MCP).