We ran the agents on a Reflex port of a react demo (a small business’ admin panel). The task was to find the "Smith" with the most orders, accept their pending reviews, mark their most recent order as delivered.
Results (medians, n=5 API / n=3 vision):
- Vision agent: 47 steps, 495k tokens, ~14 min - API agent: 8 calls, 12k tokens, 19.7s
The vision agent failed on the abstract task and needed a 14-step UI walkthrough before completing it, and even with the walkthrough it made 47 round-trips each carrying a full-page screenshot.
Vision-run variance was wide enough (853-1296s, 407k-751k tokens) that a single run isn't representative, while API runs were tightly clustered. This is the cost of being lazy about making an agent-friendly interface.
The endpoints in Path B were auto-generated by a plugin shipped in Reflex 0.9 this week. You can find full methodology here: https://reflex.dev/blog/vision-agents-vs-api-calls/