I’ve been experimenting with the latest "computer use" models (like Gemini 3 flash, qwen 3 vl plus, browser use), and while they are impressive, I hit a wall with reliability in production use cases.
The main issue I found is context. When we give agents simple natural language prompts (e.g., "download the invoice"), they often lack the nuance to handle edge cases or specific UI quirks. They try to be "creative" when they should be deterministic.
I built AI Mime to solve this by shifting from "prompting" to "demonstrating." It’s an open-source macOS tool that lets you record a workflow, parameterize it, and replay it using computer-use agents.
How it works:
Record: It captures native macOS events (mouse, keyboard, window states) to create a ground-truth recording of the task.
Refine (The interesting part): It uses an LLM to parse that raw recording into parameterized instructions. Instead of a static macro, you get manageable subtasks where you can define inputs/variables. This constrains the agent to a specific "happy path" while still allowing it to handle dynamic elements.
Replay: The agent executes the subtasks using the computer-use interface, but with significantly higher success rates because it has "seen" the exact steps required.
The goal is to make these agents observable and repeatable enough for actual RPA work.
The repo is here: https://github.com/prakhar1114/ai_mime
I’d love to hear your thoughts on the approach or how you are currently handling state/reliability with computer-use models.