The results were surprising:
Many top models miss by tens to hundreds of pixels on trivial tasks (e.g., center of a purple circle or red square). High run-to-run variance in some models (different answers on the same image/prompt). Performance flips dramatically with resolution or aspect ratio changes. Claude Sonnet and Claude Haiku are consistently near-perfect (0–1px error), while others show clear gaps. We wrote a detailed blog post about the findings: https://autodevice.io/blog/wheres-the-pixel-part-1
Repo (easy to run, add tests, try new models): https://autodevice.github.io/PixelPointingBenchmark/
Curious to see how the latest vision LLMs do on this. If you run it, share your results or feedback.
Happy to discuss improvements or extensions!
#VisionLLM #LLM #Benchmark #SpatialReasoning #GUI #ComputerUse #AI