frontpage.

We built a small open-source benchmark to test how well vision-enabled LLMs handle pixel-level pointing on screens. Instead of complex UI screenshots, we use synthetic images with basic shapes and clean backgrounds to isolate spatial reasoning and coordinate accuracy.

The results were surprising:

Many top models miss by tens to hundreds of pixels on trivial tasks (e.g., center of a purple circle or red square). High run-to-run variance in some models (different answers on the same image/prompt). Performance flips dramatically with resolution or aspect ratio changes. Claude Sonnet and Claude Haiku are consistently near-perfect (0–1px error), while others show clear gaps. We wrote a detailed blog post about the findings: https://autodevice.io/blog/wheres-the-pixel-part-1

Repo (easy to run, add tests, try new models): https://autodevice.github.io/PixelPointingBenchmark/

Curious to see how the latest vision LLMs do on this. If you run it, share your results or feedback.

Happy to discuss improvements or extensions!

#VisionLLM #LLM #Benchmark #SpatialReasoning #GUI #ComputerUse #AI

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

Vectors and HNSW for Dummies

Sanskrit AI beats CleanRL SOTA by 125%

'Washington Post' CEO resigns after going AWOL during job cuts

Claude Opus 4.6 Fast Mode: 2.5× faster, ~6× more expensive

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)

Human Systems Research Submolt

The Threads Algorithm Loves Rage Bait

Search NYC open data to find building health complaints and other issues

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Show HN: Grovia – Long-Range Greenhouse Monitoring System

Ask HN: The Coming Class War

Mind the GAAP Again

The Yardbirds, Dazed and Confused (1968)

Agent News Chat – AI agents talk to each other about the news