frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Semantic geometry visual grounding for AI web agents (Amazon demo)

1•tonyww•2h ago
Hi HN,

I’m a solo founder working on SentienceAPI, a perception & execution layer that helps LLM agents act reliably on real websites.

LLMs are good at planning steps, but they fail a lot when actually interacting with the web. Vision-only agents are expensive and unstable, and DOM-based automation breaks easily on modern pages with overlays, dynamic layouts, and lots of noise.

My approach is semantic geometry-based visual grounding.

Instead of giving the model raw HTML (huge context) or a screenshot (imprecise) and asking it to guess, the API first reduces a webpage into a small, grounded action space made only of elements that are actually visible and interactable. Each element includes geometry plus lightweight visual cues, so the model can decide what to do without guessing.

I built a reference app called MotionDocs on top of this. The demo below shows the system navigating Amazon Best Sellers, opening a product, and clicking “Add to cart” using grounded coordinates (no scripted clicks).

Demo video (Add to Cart): [https://youtu.be/1DlIeHvhOg4](https://youtu.be/1DlIeHvhOg4)

How the agent sees the page (map mode wireframe): [https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...](https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...)

This wireframe shows the reduced action space surfaced to the LLM. Each box corresponds to a visible, interactable element.

Code excerpt (simplified):

``` from sentienceapi_sdk import SentienceApiClient from motiondocs import generate_video

video = generate_video( url="https://www.amazon.com/gp/bestsellers/", instructions="Open a product and add it to cart", sentience_client=SentienceApiClient(api_key="your-api-key-here") )

video.save("demo.mp4") ```

How it works (high level):

The execution layer treats the browser as a black box and exposes three modes:

* Map: identify interactable elements with geometry and visual cues * Visual: align geometry with screenshots for grounding * Read: extract clean, LLM-ready text

The key insight is visual cues, especially a simple is_primary signal. Humans don’t read every pixel — we scan for visual hierarchy. Encoding that directly lets the agent prioritize the right actions without processing raw pixels or noisy DOM.

Why this matters:

* smaller action space → fewer hallucinations * deterministic geometry → reproducible execution * cheaper than vision-only approaches

TL;DR: I’m building a semantic geometry grounding layer that turns web pages into a compact, visually grounded action space for LLM agents. It gives the model a cheat sheet instead of asking it to solve a vision puzzle.

This is early work, not launched yet. I’d love feedback or skepticism, especially from people building agents, RPA, QA automation, or dev tools.

— Tony W

Comments

tonyww•2h ago
Example JSON Response (Simplified):

```

[ { "id": 42, "role": "button", "text": "Add to Cart", "bbox": { "x": 935, "y": 529, "w": 200, "h": 50 }, "visual_cues": { "cursor": "pointer", "is_primary": true, "color_name": "yellow" } }, { "id": 43, "role": "link", "text": "Privacy Policy", "bbox": { "x": 100, "y": 1200, "w": 80, "h": 20 }, "visual_cues": { "cursor": "pointer", "is_primary": false } } ]

```

This prototype builds on several open-source libraries:

MoviePy – video composition and rendering Pillow (PIL) – image processing and overlays

The demo app (MotionDocs) uses the public SentienceAPI SDK, generated from OpenAPI, which is the same interface used by the system internally.

PLISS 2026: Programming Language Implementation Summer School

https://pliss.org/2026/
1•azhenley•48s ago•0 comments

Data Bank – Nuforc – Latest UFO Sightings

https://nuforc.org/databank/
1•handfuloflight•4m ago•0 comments

Understanding the Northern Lights

https://www.historytoday.com/archive/feature/understanding-northern-lights
1•benbreen•10m ago•0 comments

Food becoming more calorific but less nutritious due to rising CO2

https://www.theguardian.com/environment/2025/dec/19/higher-carbon-dioxide-food-more-calorific-les...
1•slater•11m ago•0 comments

BBC News Watch Live

https://www.bbc.com/watch-live-news
1•gurjeet•12m ago•0 comments

Is Cognitive Dissonance a Thing?

https://www.newyorker.com/culture/the-lede/is-cognitive-dissonance-actually-a-thing
1•Caiero•15m ago•0 comments

Foreign ship gets penalty for illegally using Starlink within Chinese waters

https://www.scmp.com/news/china/diplomacy/article/3337026/foreign-ship-gets-penalty-illegally-usi...
1•NewCzech•19m ago•0 comments

Ask HN: Favorite Nonfiction Books of 2025

2•koevet•23m ago•1 comments

Trump Announces Pricing Deals with Nine Drugmakers

https://www.nytimes.com/2025/12/19/health/trump-drug-pricing-deals.html
1•thelastgallon•27m ago•0 comments

MyRobot, a Venture Capital Pitch

1•daly•35m ago•1 comments

Alignment Scry – LessWrong

https://www.lesswrong.com/posts/mB2knzKYcrZkTssYQ/show-lw-alignment-scry-1#
1•Xyra•36m ago•0 comments

Local Lock Down Lobe Chat Setup

https://www.ppppp.dev/local-lock-down-lobo-chat-setup/
1•I_like_tomato•36m ago•0 comments

What's New in Ruby 4.0

https://blog.codeminer42.com/whats-new-in-ruby-4-0/
1•thunderbong•38m ago•0 comments

Google Backing Bitcoin Miners Marks a Quiet $5B Shift Toward AI

2•bitstrategist•39m ago•1 comments

The Deviancy Signal: Having "Nothing to Hide" Is a Threat to Us All

https://thompson2026.com/blog/deviancy-signal/
1•NickForLiberty•40m ago•1 comments

Here you can find the contents of the Unix v4 tape ready for bootstrapping

http://squoze.net/UNIX/v4/README
3•zdw•47m ago•0 comments

We Put an AI Vending Machine in Our Office. It Gave Away Everything. [video]

https://www.youtube.com/watch?v=SpPhm7S9vsQ
1•MBCook•51m ago•0 comments

A 'recession' is arriving for people who want jobs in technology

https://www.washingtonpost.com/business/2025/12/17/tech-jobs-unemployment/
1•1vuio0pswjnm7•56m ago•0 comments

How I Bought A Car Factory at 34, Explained (13 min video)

https://www.youtube.com/watch?v=1Sw6YZDBOhg
1•rmason•57m ago•1 comments

Sunset Section 230 and Unleash the First Amendment

https://cacm.acm.org/opinion/sunset-section-230-and-unleash-the-first-amendment/
1•zdw•57m ago•0 comments

It boots (Linux compatible kernel)

https://github.com/jgarzik/hk
1•jgarzik•57m ago•1 comments

FTVDB – community-submitted Fire TV and Kindle E Ink firmware URL database

https://ftvdb.com
1•IGHOR•1h ago•1 comments

Ask HN: The internet is getting worse every day. How do we fix this?

24•electrodisk•1h ago•26 comments

Android introduces $2-4 install fee and 10–20% cut for US external content links

https://support.google.com/googleplay/android-developer/answer/16470497?hl=en
36•radley•1h ago•12 comments

Model Context Protocol Wrapped

https://2025.mcpwrapped.den.dev/
1•dend•1h ago•0 comments

How Venezuela's Machado Survived the Riskiest Leg of Her Escape

https://www.wsj.com/world/americas/rescued-at-sea-how-venezuelas-machado-survived-the-riskiest-le...
1•gmays•1h ago•1 comments

Undefinable yet Indispensable

https://aeon.co/essays/the-word-religion-resists-definition-but-remains-necessary
2•Thevet•1h ago•0 comments

DMS Is a Desktop Shell for Wayland Compositors Built with Quickshell and Go

https://danklinux.com/
2•kermatt•1h ago•1 comments

Introduction to Artificial Intelligence

https://inst.eecs.berkeley.edu/~cs188/textbook/
1•swatson741•1h ago•0 comments

Show HN: MyEverly – a privacy-first AI thought companion (no accounts)

https://myeverly.app
1•StealthyStart•1h ago•0 comments