A verification layer for browser agents: Amazon case study

https://sentienceapi.com/blog/verification-layer-amazon-case-study

56•tonyww•1w ago

A common approach to automating Amazon shopping or similar complex websites is to reach for large cloud models (often vision-capable). I wanted to test a contradiction: can a ~3B parameter local LLM model complete the flow using only structural page data (DOM) plus deterministic assertions?

This post summarizes four runs of the same task (search → first product → add to cart → checkout on Amazon). The key comparison is Demo 0 (cloud baseline) vs Demo 3 (local autonomy); Demos 1–2 are intermediate controls.

More technical detail (architecture, code excerpts, additional log snippets):

https://www.sentienceapi.com/blog/verification-layer-amazon-...

Demo 0 vs Demo 3:

Demo 0 (cloud, GLM‑4.6 + structured snapshots) success: 1/1 run tokens: 19,956 (~43% reduction vs ~35k estimate) time: ~60,000ms cost: cloud API (varies) vision: not required

Demo 3 (local, DeepSeek R1 planner + Qwen ~3B executor) success: 7/7 steps (re-run) tokens: 11,114 time: 405,740ms cost: $0.00 incremental (local inference) vision: not required

Latency note: the local stack is slower end-to-end here largely because inference runs on local hardware (Mac Studio with M4); the cloud baseline benefits from hosted inference, but has per-token API cost.

Architecture

This worked because we changed the control plane and added a verification loop.

1) Constrain what the model sees (DOM pruning). We don’t feed the entire DOM or screenshots. We collect raw elements, then run a WASM pass to produce a compact “semantic snapshot” (roles/text/geometry) and prune the rest (often on the order of ~95% of nodes).

2) Split reasoning from acting (planner vs executor).

Planner (reasoning): DeepSeek R1 (local) generates step intent + what must be true afterward. Executor (action): Qwen ~3B (local) selects concrete DOM actions like CLICK(id) / TYPE(text). 3) Gate every step with Jest‑style verification. After each action, we assert state changes (URL changed, element exists/doesn’t exist, modal/drawer appeared). If a required assertion fails, the step fails with artifacts and bounded retries.

Minimal shape:

ok = await runtime.check( exists("role=textbox"), label="search_box_visible", required=True, ).eventually(timeout_s=10.0, poll_s=0.25, max_snapshot_attempts=3)

What changed between “agents that look smart” and agents that work Two examples from the logs:

Deterministic override to enforce “first result” intent: “Executor decision … [override] first_product_link -> CLICK(1022)”

Drawer handling that verifies and forces the correct branch: “result: PASS | add_to_cart_verified_after_drawer”

The important point is that these are not post‑hoc analytics. They are inline gates: the system either proves it made progress or it stops and recovers.

Takeaway If you’re trying to make browser agents reliable, the highest‑leverage move isn’t a bigger model. It’s constraining the state space and making success/failure explicit with per-step assertions.

Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.

Comments

tonyww•1w ago

A quick clarification on intent, since “browser automation” means different things to different people:

This isn’t about making scripts smarter or replacing Playwright/Selenium. The problem I’m exploring is reliability: how to make agent-driven browser execution fail deterministically and explainably instead of half-working when layouts change.

Concretely, the agent doesn’t just “click and hope”. Each step is gated by explicit post-conditions, similar to how tests assert outcomes:

---- ## Python Code Example:

ready = runtime.assert_( all_of(url_contains("checkout"), exists("role=button")), "checkout_ready", required=True )

----

If the condition isn’t met, the run stops with artifacts instead of drifting forward. Vision models are optional fallbacks, not the primary control signal.

Happy to answer questions about the design tradeoffs or where this approach falls short

joeframbach•1w ago

Does the browser expose its accessibility tree instead of the raw dom element tree? The accessibility tree should be enough, I mean, it's all that's needed for vision impaired customers, and technically the ai agent _is_ a vision impaired customer. For a fair usage, try the accessibility tree.

tonyww•1w ago

The accessibility tree is definitely useful, and we do look at it. The issue we ran into is that it’s optimized for assistive consumption, not for action verification or layout reasoning on dynamic SPAs.

In practice we’ve seen cases where AX is incomplete, lags hydration, or doesn’t reflect overlays / grouping accurately. It does not support ordinality queries well. That’s why we anchor on post-rendered DOM + geometry and then verify outcomes explicitly, rather than relying on any single representation.

vilecoyote•1w ago

I took a look at the quickstart with aim of running this locally and found that an API key is needed for the importance ranking.

What exactly is importance ranking? Does the verification layer still exists without this ranking?

tonyww•1w ago

Importance ranking is just a heuristic pass that scores/prioritizes elements (size, visibility, role, state) so the snapshot stays small and focused. It’s deterministic, not ML.

The verification layer absolutely still exists without it — assertions, predicates, retries, and artifacts all work locally. The API-backed ranking just improves pruning quality on very dense pages, but it’s not required for correctness.

You can set use_api = False in the SnapshotOptions to avoid using the api

Akranazon•1w ago

It is interesting subject matter, I am working on something similar. But the descriptions are quite terse. Maybe I just failed to gleam:

* When you "run a WASM pass", how is that generated? Do you use an agent to do the pruning step, or is it deterministic?

* Where do the "deterministic overrides" come from? I assume they are generated by the verifier agent?

tonyww•1w ago

The WASM pass is fully deterministic: it’s just code running in the page to extract and prune post-rendered elements (roles, geometry, visibility, layout, etc), no agent involved in the chrome extension .

The “deterministic overrides” aren’t generated by a verifier agent either; they’re runtime rules that kick in when assertions or ordinality constraints are explicit (e.g. “first result”). The verifier just checks outcomes — it doesn’t invent actions. Because the nature of ai agents is non-deterministic, which we don’t want to introduce to the verification layer (predicate only).

Akranazon•1w ago

> they’re runtime rules that kick in when assertions or ordinality constraints are explicit

So there a pre-defined list of rules - is it choosing which checks to care about from the set, or is there also a predefined binding between the task and the test?

If it's the former, then you have to ensure that the checks are sufficiently generic that there's a useful test for the given situation. Is an AI doing the choosing, over which of the checks to run?

If it's the ladder, I would assume that writing the tests would be the bottleneck, writing a test can be as flaky/time-consuming as implementing the actions by hand.

tonyww•1w ago

It’s mostly the former: there’s a small set of generic checks/primitives, and we choose which ones to apply per step.

The binding between “task/step” and “what to verify” can come from either:

the user (explicit assertions), or the planner/executor proposing a post-condition (e.g. “after clicking checkout, URL contains /checkout and a checkout button exists”).

But the verifier itself is not an AI, by design it’s predicate-only

wewtyflakes•1w ago

I have found that a hybrid viewport screenshot + textual 'semantic snapshot' approach leads to the best outcomes, though sometimes text-only can be fine if the underlying page is not made of a complete mess of frameworks that would otherwise confuse normal click handlers, etc.

I think using a logical diff to do pass/fail checking is clever, though I wonder if there are failure modes there that may confuse things, such as verifying highly dynamic webpages that change their content even without active user interactions.

tonyww•1w ago

Totally agree - hybrid approaches can work well, especially on messy pages. We’ve seen the same tradeoff.

On the verification side though, dynamic pages are exactly the reason why we scope assertions narrowly (specific predicates, bounded retries using eventually() function) instead of diffing the whole page. If the expected condition can’t be proven within that window, we fail fast rather than guessing.

augusteo•1w ago

The shift from "click and hope" to explicit post-conditions is the right framing.

We've been building agent-based automation and the reliability problem is brutal. An agent can be 95% accurate on each step, but chain ten steps together and you're at 60% success rate. That's not usable.

Curious about the failure modes though. What happens when the verification itself is wrong? Like, the cart shows updated on screen but the verification layer checks a stale element?

tonyww•1w ago

Absolutely agree on the compounding error point - that’s exactly what pushed us toward verification.

On “verification wrong”: we try hard to keep predicates grounded and re-evaluated, not “check a cached handle”. Assertions do re-snapshot / re-query during each retry, and we scope them to signals that should change (URL, existence/state of an element, text/value).

If the page is flaky/stale, the assertion just won’t prove the condition within the retry window and we fail with artifacts such as frames of clip (if ffmpeg available) rather than claiming success.

There are still edge cases (virtualized DOM, optimistic UI, async updates), but in those cases the goal is the same: make the failure explicit and debuggable with artifacts and time-travel traces, not silently drift.

Selkirk•1w ago

So ... Test Driven Development?

1. Planner (Write a failing test or tests) 2. Executor (Generate a solution) 3. Verifier (Until the tests no longer fail) 4. Repeat

tonyww•1w ago

Yeah, that’s a pretty good analogy.

The main difference is that the “tests” are predicates over live browser state and are often proposed alongside the plan on the fly, not written upfront by a developer. But conceptually it’s very close: make the expected outcome explicit, try an action, verify, and only move forward if the condition actually holds.

yencabulator•1w ago

What happens when the site changes and both your "DOM pruning" and "required assertions" are outdated?

Queueing Theory v2: DORA metrics, queue-of-queues, chi-alpha-beta-sigma notation

Show HN: Hibana – choreography-first protocol safety for Rust

Haniri: A live autonomous world where AI agents survive or collapse

GPT-5.3-Codex System Card [pdf]

Atlas: Manage your database schema as code

Geist Pixel

Show HN: MCP to get latest dependency package and tool versions

The better you get at something, the harder it becomes to do

Show HN: WP Float – Archive WordPress blogs to free static hosting

Show HN: I Hacked My Family's Meal Planning with an App

Sony BMG copy protection rootkit scandal

The Future of Systems

NASA now allowing astronauts to bring their smartphones on space missions

Claude Code Is the Inflection Point

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

AI Agent Automates Google Stock Analysis from Financial Reports

Voxtral Realtime 4B Pure C Implementation

I Was Trapped in Chinese Mafia Crypto Slavery [video]

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

Study of 150 developers shows AI generated code no harder to maintain long term

Spotify now requires premium accounts for developer mode API access

When Albert Einstein Moved to Princeton

Agents.md as a Dark Signal

System time, clocks, and their syncing in macOS

McCLIM and 7GUIs – Part 1: The Counter

So whats the next word, then? Almost-no-math intro to transformer models

Queueing Theory v2: DORA metrics, queue-of-queues, chi-alpha-beta-sigma notation

Show HN: Hibana – choreography-first protocol safety for Rust

Haniri: A live autonomous world where AI agents survive or collapse

GPT-5.3-Codex System Card [pdf]

Atlas: Manage your database schema as code

Geist Pixel

Show HN: MCP to get latest dependency package and tool versions

The better you get at something, the harder it becomes to do

Show HN: WP Float – Archive WordPress blogs to free static hosting

Show HN: I Hacked My Family's Meal Planning with an App

Sony BMG copy protection rootkit scandal

The Future of Systems

NASA now allowing astronauts to bring their smartphones on space missions

Claude Code Is the Inflection Point

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

AI Agent Automates Google Stock Analysis from Financial Reports

Voxtral Realtime 4B Pure C Implementation

I Was Trapped in Chinese Mafia Crypto Slavery [video]

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

Study of 150 developers shows AI generated code no harder to maintain long term

Spotify now requires premium accounts for developer mode API access

When Albert Einstein Moved to Princeton

Agents.md as a Dark Signal

System time, clocks, and their syncing in macOS

McCLIM and 7GUIs – Part 1: The Counter

So whats the next word, then? Almost-no-math intro to transformer models

A verification layer for browser agents: Amazon case study

Comments