frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Semantic geometry visual grounding for AI web agents (Amazon demo)

2•tonyww•1mo ago
Hi HN,

I’m a solo founder working on SentienceAPI, a perception & execution layer that helps LLM agents act reliably on real websites.

LLMs are good at planning steps, but they fail a lot when actually interacting with the web. Vision-only agents are expensive and unstable, and DOM-based automation breaks easily on modern pages with overlays, dynamic layouts, and lots of noise.

My approach is semantic geometry-based visual grounding.

Instead of giving the model raw HTML (huge context) or a screenshot (imprecise) and asking it to guess, the API first reduces a webpage into a small, grounded action space made only of elements that are actually visible and interactable. Each element includes geometry plus lightweight visual cues, so the model can decide what to do without guessing.

I built a reference app called MotionDocs on top of this. The demo below shows the system navigating Amazon Best Sellers, opening a product, and clicking “Add to cart” using grounded coordinates (no scripted clicks).

Demo video (Add to Cart): [https://youtu.be/1DlIeHvhOg4](https://youtu.be/1DlIeHvhOg4)

How the agent sees the page (map mode wireframe): [https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...)

This wireframe shows the reduced action space surfaced to the LLM. Each box corresponds to a visible, interactable element.

Code excerpt (simplified):

``` from sentienceapi_sdk import SentienceApiClient from motiondocs import generate_video

video = generate_video( url="https://www.amazon.com/gp/bestsellers/", instructions="Open a product and add it to cart", sentience_client=SentienceApiClient(api_key="your-api-key-here") )

video.save("demo.mp4") ```

How it works (high level):

The execution layer treats the browser as a black box and exposes three modes:

* Map: identify interactable elements with geometry and visual cues * Visual: align geometry with screenshots for grounding * Read: extract clean, LLM-ready text

The key insight is visual cues, especially a simple is_primary signal. Humans don’t read every pixel — we scan for visual hierarchy. Encoding that directly lets the agent prioritize the right actions without processing raw pixels or noisy DOM.

Why this matters:

* smaller action space → fewer hallucinations * deterministic geometry → reproducible execution * cheaper than vision-only approaches

TL;DR: I’m building a semantic geometry grounding layer that turns web pages into a compact, visually grounded action space for LLM agents. It gives the model a cheat sheet instead of asking it to solve a vision puzzle.

This is early work, not launched yet. I’d love feedback or skepticism, especially from people building agents, RPA, QA automation, or dev tools.

— Tony W

Comments

tonyww•1mo ago
Example JSON Response (Simplified):

```

[ { "id": 42, "role": "button", "text": "Add to Cart", "bbox": { "x": 935, "y": 529, "w": 200, "h": 50 }, "visual_cues": { "cursor": "pointer", "is_primary": true, "color_name": "yellow" } }, { "id": 43, "role": "link", "text": "Privacy Policy", "bbox": { "x": 100, "y": 1200, "w": 80, "h": 20 }, "visual_cues": { "cursor": "pointer", "is_primary": false } } ]

```

This prototype builds on several open-source libraries:

MoviePy – video composition and rendering Pillow (PIL) – image processing and overlays

The demo app (MotionDocs) uses the public SentienceAPI SDK, generated from OpenAPI, which is the same interface used by the system internally.

Effects of Zepbound on Stool Quality

https://twitter.com/ScottHickle/status/2020150085296775300
1•aloukissas•3m ago•0 comments

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

https://seedance.ai/
1•bigbromaker•6m ago•0 comments

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

1•andrewstuart•12m ago•1 comments

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

https://www.cbsnews.com/news/pentagon-says-its-cutting-ties-with-woke-harvard-discontinuing-milit...
2•alephnerd•15m ago•1 comments

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

https://cds.cern.ch/record/405662/files/PhysRev.47.777.pdf
1•northlondoner•15m ago•1 comments

Kessler Syndrome Has Started [video]

https://www.tiktok.com/@cjtrowbridge/video/7602634355160206623
1•pbradv•18m ago•0 comments

Complex Heterodynes Explained

https://tomverbeure.github.io/2026/02/07/Complex-Heterodyne.html
3•hasheddan•18m ago•0 comments

EVs Are a Failed Experiment

https://spectator.org/evs-are-a-failed-experiment/
2•ArtemZ•30m ago•4 comments

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

https://www.databricks.com/blog/memalign-building-better-llm-judges-human-feedback-scalable-memory
1•superchink•30m ago•0 comments

CCC (Claude's C Compiler) on Compiler Explorer

https://godbolt.org/z/asjc13sa6
2•LiamPowell•32m ago•0 comments

Homeland Security Spying on Reddit Users

https://www.kenklippenstein.com/p/homeland-security-spies-on-reddit
3•duxup•35m ago•0 comments

Actors with Tokio (2021)

https://ryhl.io/blog/actors-with-tokio/
1•vinhnx•36m ago•0 comments

Can graph neural networks for biology realistically run on edge devices?

https://doi.org/10.21203/rs.3.rs-8645211/v1
1•swapinvidya•48m ago•1 comments

Deeper into the shareing of one air conditioner for 2 rooms

1•ozzysnaps•50m ago•0 comments

Weatherman introduces fruit-based authentication system to combat deep fakes

https://www.youtube.com/watch?v=5HVbZwJ9gPE
3•savrajsingh•51m ago•0 comments

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

http://www.effacermonexistence.com/rcc-hn-1-1
1•formerOpenAI•53m ago•2 comments

A Curated List of ML System Design Case Studies

https://github.com/Engineer1999/A-Curated-List-of-ML-System-Design-Case-Studies
3•tejonutella•57m ago•0 comments

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

https://ponyalpha.pro
1•qzcanoe•1h ago•1 comments

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

https://github.com/Goofygiraffe06/tunbot
2•g1raffe•1h ago•0 comments

Open Problems in Mechanistic Interpretability

https://arxiv.org/abs/2501.16496
2•vinhnx•1h ago•0 comments

Bye Bye Humanity: The Potential AMOC Collapse

https://thatjoescott.com/2026/02/03/bye-bye-humanity-the-potential-amoc-collapse/
3•rolph•1h ago•0 comments

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

https://github.com/virattt/dexter
1•Lwrless•1h ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE
1•vermilingua•1h ago•0 comments

Essential CDN: The CDN that lets you do more than JavaScript

https://essentialcdn.fluidity.workers.dev/
1•telui•1h ago•1 comments

They Hijacked Our Tech [video]

https://www.youtube.com/watch?v=-nJM5HvnT5k
2•cedel2k1•1h ago•0 comments

Vouch

https://twitter.com/mitchellh/status/2020252149117313349
40•chwtutha•1h ago•6 comments

HRL Labs in Malibu laying off 1/3 of their workforce

https://www.dailynews.com/2026/02/06/hrl-labs-cuts-376-jobs-in-malibu-after-losing-government-work/
4•osnium123•1h ago•1 comments

Show HN: High-performance bidirectional list for React, React Native, and Vue

https://suhaotian.github.io/broad-infinite-list/
2•jeremy_su•1h ago•0 comments

Show HN: I built a Mac screen recorder Recap.Studio

https://recap.studio/
1•fx31xo•1h ago•1 comments

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?

1•kachapopopow•1h ago•0 comments