Computer vision is solved if you let the model use tools

https://www.spatial-reasoning.com/share/45dfaeaa-e5a1-4a8c-a8c1-44f9ff5371a4

1•qasimWani•6mo ago

Comments

qasimWani•6mo ago

i previously co-founded a synthetic data company, focused on fine-tuning diffusion models for robotics and manufacturing. the standard approach: generate better data, train smaller models, deploy. recently, reasoning models like o3, grok, and gemini began showing signs of strong spatial awareness. so i tested them on bounding box detection in complex scenes. they failed. badly.

but the reasoning trace showed impressive semantic understanding. the failure wasn’t conceptual. it came from tokenization and decoding limits. the models knew what they were seeing but couldn’t translate it into precise coordinates. (gemini 2.5 performs better because it uses an MoE with task-specific heads).

as such, i built a simple system that gives these models tools:

1. overlay a reference grid (inspired by Set of Marks, Microsoft 2023) to ground them visually

2. crop and zoom into regions of interest

3. call external detectors like Grounding DINO when helpful

with only prompting, this setup enables zero-shot object detection on tasks that traditional vision models fail. for example, detecting the barely visible YC logo on this person's jacket from a linkedin feed screenshot is only possible once you zoom into the right regions [https://www.spatial-reasoning.com/share/45dfaeaa-e5a1-4a8c-a...]

demo here: [spatial-reasoning.com] open-source code: [https://github.com/QasimWani/spatial-reasoning]

curious to hear thoughts. still exploring edge cases and failure modes. might write a more detailed blog if there’s interest.

qasimWani•6mo ago

another harder example: detecting a street sign on market st in sf that only becomes findable after multiple zoom-ins [https://www.spatial-reasoning.com/share/d7bab348-3389-41c7-9...]

one interesting pattern: forcing the model to keep its reasoning chain internal (i.e., no verbose "think step-by-step") actually improves accuracy. it seems to reduce hallucinations and overcorrections. still working on a clearer theory, but shorter chains seem to preserve spatial focus better.

curious how others think tool use like this could generalize.

also open to any references on visual grounding in LMMs. feels like a strangely underexplored space.

sota_pop•6mo ago

I’ve always felt CNNs are much more natural for visual analysis. It’s funny/unfortunate that transformers work SO well that their performance CAN rival CNNs, but it takes so much more work/processing power/model size. CNNs just feel like a more ergonomic fit to the problem (to me), but my experience is rooted in studying DL from when GANs were all the rage and “Attention Is All You Need” was a brand new paper, and admittedly, I need to brush up on my ViT theory.

qasimWani•6mo ago

yeah having that convolution prior is definitely useful when you're dealing with limited amount of data, because you're encoding problem structure into the model, which is why they get away with being trained on fewer samples but with a trade off around generalization.

but i think this moment is quite different because instead of baking everything in the latent space for these models, you're letting them reason how a human would - if i was asked to detect for the street sign i'd first start by zooming into different regions and iteratively figure out what is relevant. Yolo and other models don't do this well enough because they lack the language component which is a must have for complex reasoning like this for example: https://www.spatial-reasoning.com/share/2d4a8827-b227-4f23-a....

Like 4o can't do this even though it most likely has the same vision encoder as o4. this is the power of reasoning.

sota_pop•6mo ago

Isn’t this (subdividing into regions and analyzing each region within the context of the overall image) - essentially - the methodology of the YOLO algorithm?

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex

From Zero to Hero: A Spring Boot Deep Dive

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Software Engineering Is Back

Storyship: Turn Screen Recordings into Professional Demos

Reputation Scores for GitHub Accounts

A BSOD for All Seasons – Send Bad News via a Kernel Panic

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

Omarchy First Impressions

Reinforcement Learning from Human Feedback

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex

From Zero to Hero: A Spring Boot Deep Dive

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Software Engineering Is Back

Storyship: Turn Screen Recordings into Professional Demos

Reputation Scores for GitHub Accounts

A BSOD for All Seasons – Send Bad News via a Kernel Panic

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

Omarchy First Impressions

Reinforcement Learning from Human Feedback

Computer vision is solved if you let the model use tools

Comments