What it does:
- During a session, you can ask "What's the XPath for this button?" and get a ready-to-use locator from the current screen - Ask "Write an Appium test for this flow" → get test code generated from the live accessibility tree - Type "tap the login button" in natural language → it executes on the real device - Ask "Why is my test failing on this element?" → gets context from both vision and the accessibility snapshot
The agent uses a combination of screenshot vision and the device's live accessibility tree. The key insight is that most mobile test failures are locator issues or UI state issues — and an agent with full context of what's on screen right now can solve those immediately, without the engineer leaving the session to use a separate inspector tool.
Technical bits: - Accessibility tree is captured per-frame during the session - Agent has both visual context (screenshot) and structured context (a11y tree) simultaneously - Supports Android (UIAutomator2/XPath/UISelector) and iOS (XCUITest/Appium) - Session context is also exposed via API for CI/CD post-failure reports
Happy to discuss the architecture, especially the tradeoffs between using vision alone vs. vision + a11y tree for locator generation.