Show HN: Sentience – Semantic Visual Grounding for AI Agents (WASM and ONNX)

2•tonyww•1mo ago

Hi HN, I’m the solo founder behind SentienceAPI. I’ve spent the last December building a browser automation runtime designed specifically for LLM agents.

The Problem: Building reliable web agents is painful. You essentially have two bad choices:

Raw DOM: Dumping document.body.innerHTML is cheap/fast but overwhelms the context window (100k+ tokens) and lacks spatial context (agents try to click hidden or off-screen elements).

Vision Models (GPT-4o): Sending screenshots is robust but slow (3-10s latency) and expensive (~$0.01/step). Worse, they often hallucinate coordinates, missing buttons by 10 pixels. The Solution: Semantic Geometry Sentience is a "Visual Cortex" for agents. It sits between the browser and your LLM, turning noisy websites into clean, ranked, coordinate-aware JSON.

How it works (The Stack):

Client (WASM): A Chrome Extension injects a Rust/WASM module that prunes 95% of the DOM (scripts, tracking pixels, invisible wrappers) directly in the browser process. It handles Shadow DOM, nested iframes ("Frame Stitching"), and computed styles (visibility/z-index) in <50ms.

Gateway (Rust/Axum): The pruned tree is sent to a Rust gateway that applies heuristic importance scoring with simple visual cues (e.g. is_primary)

Brain (ONNX): A server-side ML layer (running ms-marco-MiniLM via ort) semantically re-ranks the elements based on the user’s goal (e.g., "Search for shoes").

Result: Your agent gets a list of the Top 50 most relevant interactable elements with exact (x,y) coordinates with importance value and visual cues, helping LLM agent make decision.

Performance:

Cost: ~$0.001 per step (vs. $0.01+ for Vision)

Latency: ~400ms (vs. 5s+ for Vision)

Payload: ~1400 tokens (vs. 100k for Raw HTML)

Developer Experience (The "Cool" Stuff): I hated debugging text logs, so I built Sentience Studio, a "Time-Travel Debugger." It records every step (DOM snapshot + Screenshot) into a .jsonl trace. You can scrub through the timeline like a video editor to see exactly what the agent saw vs. what it hallucinated.

Links:

Docs & SDK: https://www.sentienceapi.com/docs

GitHub (SDK): SDK Python: https://github.com/SentienceAPI/sentience-python

SDK TypeScript: https://github.com/SentienceAPI/sentience-ts

Studio Demo: https://www.sentienceapi.com/docs/studio

Build Web Agent: https://www.sentienceapi.com/docs/sdk/agent-quick-start

Screenshots with importance labels (gold stars): https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co... 2026-01-06 at 7.19.41 AM.png

https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co... 2026-01-06 at 7.19.41 AM.png

I’m handling the backend in Rust and the SDKs in Python/TypeScript. The project is now in beta launch, I would love feedbacks on the architecture or the ranking logic!

Comments

tonyww•1mo ago

One thing I didn’t emphasize enough in the post: I originally tried the “labeled screenshot + vision model” approach pretty hard. (see this screenshot labeled with bbox + ID: https://sentience-screenshots.sfo3.cdn.digitaloceanspaces.co...)

In practice it performed worse than expected. Once you overlay dense bounding boxes and numeric IDs, the model has to solve a brittle symbol-grounding problem (“which number corresponds to intent?”). On real pages (Amazon, Stripe docs, etc.) this led to more retries and mis-clicks, not fewer.

What worked better for me was moving that grounding step out of the model entirely and giving it a bounded set of executable actions (role + visibility + geometry), then letting the LLM choose which action, not where to click.

Curious if others have seen similar behavior with vision-based agents, especially beyond toy demos.

The original vi is a product of its time (and its time has passed)

Circumstantial Complexity, LLMs and Large Scale Architecture

Tech Bro Saga: big tech critique essay series

Show HN: A calculus course with an AI tutor watching the lectures with you

Show HN: 83K lines of C++ – cryptocurrency written from scratch, not a fork

Show HN: SAA – A minimal shell-as-chat agent using only Bash

Mario Tchou

Does Anyone Even Know What's Happening in Zim?

The last Morse code maritime radio station in North America [video]

Show HN: Hacker Newspaper – Yet another HN front end optimized for mobile

OpenClaw Is Changing My Life

Everything you need to know about lasers in one photo

SCOTUS to decide if 1988 video tape privacy law applies to internet uses

Epstein files reveal deeper ties to scientists than previously known

Red teamers arrested conducting a penetration test

Show HN: Open-source AI powered Kubernetes IDE

Show HN: Lucid – Use LLM hallucination to generate verified software specs

AI Doesn't Write Every Framework Equally Well

Aisbf – an intelligent routing proxy for OpenAI compatible clients

Let's handle 1M requests per second

OpenClaw Partners with VirusTotal for Skill Security

Goal: Ship 1M Lines of Code Daily

Show HN: Codex-mem, 90% fewer tokens for Codex

FastLangML: FastLangML:Context‑aware lang detector for short conversational text

LineageOS 23.2

Crypto Deposit Frauds

Substack makes money from hosting Nazi newsletters

Framing an LLM as a safety researcher changes its language, not its judgement

Are there anyone interested about a creator economy startup

Show HN: Skill Lab – CLI tool for testing and quality scoring agent skills