frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Computer vision is solved if you let the model use tools

https://www.spatial-reasoning.com/share/45dfaeaa-e5a1-4a8c-a8c1-44f9ff5371a4
1•qasimWani•14h ago

Comments

qasimWani•14h ago
i previously co-founded a synthetic data company, focused on fine-tuning diffusion models for robotics and manufacturing. the standard approach: generate better data, train smaller models, deploy. recently, reasoning models like o3, grok, and gemini began showing signs of strong spatial awareness. so i tested them on bounding box detection in complex scenes. they failed. badly.

but the reasoning trace showed impressive semantic understanding. the failure wasn’t conceptual. it came from tokenization and decoding limits. the models knew what they were seeing but couldn’t translate it into precise coordinates. (gemini 2.5 performs better because it uses an MoE with task-specific heads).

as such, i built a simple system that gives these models tools:

1. overlay a reference grid (inspired by Set of Marks, Microsoft 2023) to ground them visually

2. crop and zoom into regions of interest

3. call external detectors like Grounding DINO when helpful

with only prompting, this setup enables zero-shot object detection on tasks that traditional vision models fail. for example, detecting the barely visible YC logo on this person's jacket from a linkedin feed screenshot is only possible once you zoom into the right regions [https://www.spatial-reasoning.com/share/45dfaeaa-e5a1-4a8c-a...]

demo here: [spatial-reasoning.com] open-source code: [https://github.com/QasimWani/spatial-reasoning]

curious to hear thoughts. still exploring edge cases and failure modes. might write a more detailed blog if there’s interest.

qasimWani•14h ago
another harder example: detecting a street sign on market st in sf that only becomes findable after multiple zoom-ins [https://www.spatial-reasoning.com/share/d7bab348-3389-41c7-9...]

one interesting pattern: forcing the model to keep its reasoning chain internal (i.e., no verbose "think step-by-step") actually improves accuracy. it seems to reduce hallucinations and overcorrections. still working on a clearer theory, but shorter chains seem to preserve spatial focus better.

curious how others think tool use like this could generalize.

also open to any references on visual grounding in LMMs. feels like a strangely underexplored space.

sota_pop•14h ago
I’ve always felt CNNs are much more natural for visual analysis. It’s funny/unfortunate that transformers work SO well that their performance CAN rival CNNs, but it takes so much more work/processing power/model size. CNNs just feel like a more ergonomic fit to the problem (to me), but my experience is rooted in studying DL from when GANs were all the rage and “Attention Is All You Need” was a brand new paper, and admittedly, I need to brush up on my ViT theory.
qasimWani•14h ago
yeah having that convolution prior is definitely useful when you're dealing with limited amount of data, because you're encoding problem structure into the model, which is why they get away with being trained on fewer samples but with a trade off around generalization.

but i think this moment is quite different because instead of baking everything in the latent space for these models, you're letting them reason how a human would - if i was asked to detect for the street sign i'd first start by zooming into different regions and iteratively figure out what is relevant. Yolo and other models don't do this well enough because they lack the language component which is a must have for complex reasoning like this for example: https://www.spatial-reasoning.com/share/2d4a8827-b227-4f23-a....

Like 4o can't do this even though it most likely has the same vision encoder as o4. this is the power of reasoning.

sota_pop•14h ago
Isn’t this (subdividing into regions and analyzing each region within the context of the overall image) - essentially - the methodology of the YOLO algorithm?

Early test stage of scripts to extract ntlm and kerberos hashes from pcaps

https://gist.github.com/dleto614/5663b9de7e7449d217e6e38a5e5386c2
1•grepcat•3m ago•0 comments

Why Are Tech Workers So Dissatisfied? [video]

https://www.youtube.com/watch?v=28SuvTE5xNE
1•mgh2•9m ago•0 comments

Safari's disadvantage is OS updates (2024)

https://www.alvar.dev/blog/safari-disadvantage-os-updates
2•mooreds•16m ago•0 comments

Show HN: Metis Agent v0.6.1 – Run OpenAI's GPT OSS Locally with No API Keys

https://github.com/metisos/metisos_agentV1
1•cjohnsonpr•18m ago•0 comments

No point in fighting Drivers who appeal speed cameras almost guaranteed to lose

https://www.abcactionnews.com/news/state/theres-no-point-in-fighting-drivers-who-appeal-school-speed-zone-camera-fines-almost-guaranteed-to-lose
1•josephcsible•19m ago•1 comments

OpenAI in Talks for Share Sale Valuing Startup at $500B

https://www.bloomberg.com/news/articles/2025-08-06/openai-in-talks-for-share-sale-valuing-startup-at-500-billion
1•mfiguiere•22m ago•0 comments

Trump admin warns states: Don't try to lower broadband prices

https://arstechnica.com/tech-policy/2025/08/trump-admin-warns-states-dont-try-to-lower-broadband-prices/
5•duxup•25m ago•0 comments

Show HN: Minimal terminal-style portfolio template (~13KB) with GitHub API data

https://github.com/Cod-e-Codes/lilweb-template
1•Cod-e-Codes•25m ago•0 comments

I Was Reincarnated as the 7th Prince Season 1 Anime Review

https://www.animenewsnetwork.com/review/i-was-reincarnated-as-the-7th-prince/.226628
1•PaulHoule•29m ago•0 comments

Billions of starfish have died in a decade-long epidemic

https://www.cbsnews.com/news/starfish-sea-star-died-epidemic-scientists-know-why/
1•Brajeshwar•30m ago•0 comments

Kids shoes with a hidden AirTag compartment

https://techcrunch.com/2025/07/30/skechers-is-making-kids-shoes-with-a-hidden-airtag-compartment/
1•walterbell•30m ago•1 comments

Arenas in Rust

https://russellw.github.io/arenas
2•rwallace•35m ago•0 comments

If you're a direct employee of HP/Compaq you're not allowed to look at this code

https://www.ukcert.org.uk/repository/exploits/NETSYS_COM%20-%20The%20Intelligent%20Hacker%27s%20Choice%20-%20http--www_netsys_com-library-alerts-2002-08-05-dxchpwd.txt
2•dijksterhuis•36m ago•0 comments

The Mistake That Killed Excite: The HomeNetwork

https://en.wikipedia.org/wiki/@Home_Network
2•sans_souse•39m ago•1 comments

Show HN: I built a browser extension to add comment threads on any website

https://medium.com/@oencab2/why-im-building-a-browser-that-lets-you-leave-comments-on-the-internet-itself-9d4c2404d4b8
2•itzoen•41m ago•0 comments

Electric motor runs without metal coils

https://newatlas.com/technology/kist-cnt-cscec-carbon-nanotube-wire/
3•westurner•45m ago•1 comments

How to build realistic AI companions

https://www.emotionmachine.ai/blog/realistic-ai-companions
1•sarbak•53m ago•0 comments

What if technology is our weakness?

3•morpheos137•53m ago•1 comments

Spacebar Clicker – Ultimate Auto Clicker Game Online

https://spacebarclickers.online/
2•nico_nico•55m ago•1 comments

Show HN: PinpoinTodays – Daily Answers and History for LinkedIn's Pinpoint Game

https://pinpointodays.com
1•qinggeng•55m ago•0 comments

Science Titan sub firm used intimidation tactics and flawed safety practices

https://www.bbc.com/news/live/cdjxp4n2371t
9•teleforce•56m ago•1 comments

Cerebras now supports OpenAI GPT-OSS-120B at 3k Tokens Per SEC

https://www.cerebras.ai/news/cerebras-helps-power-openai-s-open-model-at-world-record-inference-speeds-gpt-oss-120b-delivers
3•me551ah•1h ago•0 comments

Man who lit cigarette from French war memorial flame faces legal action

https://www.theguardian.com/world/2025/aug/05/french-minister-legal-action-against-man-lit-cigarette-memorial
4•wslh•1h ago•2 comments

Never miss a conversation that matters

https://socialystener.com/
1•usamak23•1h ago•0 comments

Show HN: Virtual Ontologies with Claude Code

https://medium.com/@michael.craig.fitzgerald/whither-ontologies-d871bd3a8098
2•mcfitzgerald•1h ago•0 comments

RFK Halts mRNA vaccine research

https://www.pbs.org/newshour/health/rfk-jr-pulls-funding-for-vaccines-being-developed-to-fight-respiratory-viruses
10•worik•1h ago•1 comments

Kitten TTS: 25MB CPU-Only, Open-Source Voice Model

https://algogist.com/kitten-tts-the-25mb-ai-voice-model-thats-about-to-change-everything-runs-on-a-potato/
65•jainilprajapati•1h ago•23 comments

Engineer restores pay phones for free public use

https://www.npr.org/2025/08/04/nx-s1-5484013/engineer-restores-pay-phones-for-free-public-use
4•andsoitis•1h ago•1 comments

Cosmopolitan: Build-once run-anywhere C library

https://github.com/jart/cosmopolitan
2•Bogdanp•1h ago•0 comments

Trump threatens pharma tariffs of up to 250 percent

https://thehill.com/homenews/administration/5436846-drug-import-tariffs-trump/
3•OutOfHere•1h ago•1 comments