We just shipped a Chat AI Agent that lives inside every device session. It sees the live screen and answers questions like "what's the XPath for that button?" or "give me the UIAutomator2 selector for row 2" — with code output in Java, Python, Swift, Kotlin, or WebDriverIO.
The implementation: we grab a frame from the WebRTC stream at query time and pass it to a vision LLM with a structured prompt. The model returns locators in all applicable formats (XPath, CSS, UIAutomator2, XCUITest, Accessibility ID). We parse and render with syntax highlighting.
The main challenge was prompt engineering to get clean parseable output vs. prose with embedded code.
Happy to answer questions about the implementation.