I built this tool to solve the "flakiness" problem in UI testing. Existing AI agents often struggle with precise interactions, while traditional frameworks (Selenium/Playwright) break whenever the DOM changes.
The Approach: Instead of relying on hard-coded selectors or pure computer vision, I’m using a multi-agent system powered by multimodal LLMs. We pass both the screenshot (pixels) and the browser context (network requests, console logs, etc) to the model. This allows the agent to:
"See" the UI like a user and accurately map semantic intent ("Click the Signup button") to precise coordinates even if the layout shifts.
The goal is to mimic natural user behavior rather than following a predefined script. It handles exploratory testing and finds visual bugs that code-based assertions miss.
I’d love feedback on the implementation or to discuss the challenges of using LLMs for deterministic testing.