frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A verification layer for browser agents: Amazon case study

https://sentienceapi.com/blog/verification-layer-amazon-case-study
27•tonyww•15h ago
A common approach to automating Amazon shopping or similar complex websites is to reach for large cloud models (often vision-capable). I wanted to test a contradiction: can a ~3B parameter local LLM model complete the flow using only structural page data (DOM) plus deterministic assertions?

This post summarizes four runs of the same task (search → first product → add to cart → checkout on Amazon). The key comparison is Demo 0 (cloud baseline) vs Demo 3 (local autonomy); Demos 1–2 are intermediate controls.

More technical detail (architecture, code excerpts, additional log snippets):

https://www.sentienceapi.com/blog/verification-layer-amazon-...

Demo 0 vs Demo 3:

Demo 0 (cloud, GLM‑4.6 + structured snapshots) success: 1/1 run tokens: 19,956 (~43% reduction vs ~35k estimate) time: ~60,000ms cost: cloud API (varies) vision: not required

Demo 3 (local, DeepSeek R1 planner + Qwen ~3B executor) success: 7/7 steps (re-run) tokens: 11,114 time: 405,740ms cost: $0.00 incremental (local inference) vision: not required

Latency note: the local stack is slower end-to-end here largely because inference runs on local hardware (Mac Studio with M4); the cloud baseline benefits from hosted inference, but has per-token API cost.

Architecture

This worked because we changed the control plane and added a verification loop.

1) Constrain what the model sees (DOM pruning). We don’t feed the entire DOM or screenshots. We collect raw elements, then run a WASM pass to produce a compact “semantic snapshot” (roles/text/geometry) and prune the rest (often on the order of ~95% of nodes).

2) Split reasoning from acting (planner vs executor).

Planner (reasoning): DeepSeek R1 (local) generates step intent + what must be true afterward. Executor (action): Qwen ~3B (local) selects concrete DOM actions like CLICK(id) / TYPE(text). 3) Gate every step with Jest‑style verification. After each action, we assert state changes (URL changed, element exists/doesn’t exist, modal/drawer appeared). If a required assertion fails, the step fails with artifacts and bounded retries.

Minimal shape:

ok = await runtime.check( exists("role=textbox"), label="search_box_visible", required=True, ).eventually(timeout_s=10.0, poll_s=0.25, max_snapshot_attempts=3)

What changed between “agents that look smart” and agents that work Two examples from the logs:

Deterministic override to enforce “first result” intent: “Executor decision … [override] first_product_link -> CLICK(1022)”

Drawer handling that verifies and forces the correct branch: “result: PASS | add_to_cart_verified_after_drawer”

The important point is that these are not post‑hoc analytics. They are inline gates: the system either proves it made progress or it stops and recovers.

Takeaway If you’re trying to make browser agents reliable, the highest‑leverage move isn’t a bigger model. It’s constraining the state space and making success/failure explicit with per-step assertions.

Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.

Comments

tonyww•14h ago
A quick clarification on intent, since “browser automation” means different things to different people:

This isn’t about making scripts smarter or replacing Playwright/Selenium. The problem I’m exploring is reliability: how to make agent-driven browser execution fail deterministically and explainably instead of half-working when layouts change.

Concretely, the agent doesn’t just “click and hope”. Each step is gated by explicit post-conditions, similar to how tests assert outcomes:

---- ## Python Code Example:

ready = runtime.assert_( all_of(url_contains("checkout"), exists("role=button")), "checkout_ready", required=True )

----

If the condition isn’t met, the run stops with artifacts instead of drifting forward. Vision models are optional fallbacks, not the primary control signal.

Happy to answer questions about the design tradeoffs or where this approach falls short

joeframbach•2h ago
Does the browser expose its accessibility tree instead of the raw dom element tree? The accessibility tree should be enough, I mean, it's all that's needed for vision impaired customers, and technically the ai agent _is_ a vision impaired customer. For a fair usage, try the accessibility tree.
tonyww•1h ago
The accessibility tree is definitely useful, and we do look at it. The issue we ran into is that it’s optimized for assistive consumption, not for action verification or layout reasoning on dynamic SPAs.

In practice we’ve seen cases where AX is incomplete, lags hydration, or doesn’t reflect overlays / grouping accurately. It does not support ordinality queries well. That’s why we anchor on post-rendered DOM + geometry and then verify outcomes explicitly, rather than relying on any single representation.

ewuhic•2h ago
Slop shit discussing slop shit.
asyncadventure•1h ago
Great point about the accessibility tree @joeframbach. The "vision impaired customer" analogy is spot on - if an interface works for screen readers, it should work for AI agents.

What I find most compelling about this approach is the explicit verification layer. Too many browser automation projects fail silently or drift into unexpected states. The Jest-style assertions create a clear contract: either the step definitively succeeded or it didn't, with artifacts for debugging.

This reminds me of property-based testing - instead of hoping the agent "gets it right," you're encoding what success actually looks like.

tonyww•57m ago
Thanks — that’s exactly our motivation. The key shift for us was moving from “did the agent probably do the right thing?” to “can we prove the state we expected actually holds.”

The property-based testing analogy is a good one — once you make success explicit, failures become actionable instead of mysterious.

vilecoyote•42m ago
I took a look at the quickstart with aim of running this locally and found that an API key is needed for the importance ranking.

What exactly is importance ranking? Does the verification layer still exists without this ranking?

Microsoft forced me to switch to Linux

https://www.himthe.dev/blog/microsoft-to-linux
659•bobsterlobster•3h ago•575 comments

Airfoil (2024)

https://ciechanow.ski/airfoil/
180•brk•3h ago•27 comments

Show HN: The HN Arcade

https://andrewgy8.github.io/hnarcade/
213•yuppiepuppie•6h ago•61 comments

I Overengineered a Spinning Top

https://www.youtube.com/watch?v=Wp5NodfvvF4
35•bane•5d ago•12 comments

Show HN: I Built a Sandbox for Agents

https://github.com/vrn21/bouvet.com
6•vrn21•49m ago•1 comments

Will AIs Take All Our Jobs and End Human History–Or Not?

https://writings.stephenwolfram.com/2023/03/will-ais-take-all-our-jobs-and-end-human-history-or-n...
11•lukakopajtic•51m ago•3 comments

Show HN: Dwm.tmux – a dwm-inspired window manager for tmux

https://github.com/saysjonathan/dwm.tmux
60•saysjonathan•4d ago•7 comments

A verification layer for browser agents: Amazon case study

https://sentienceapi.com/blog/verification-layer-amazon-case-study
27•tonyww•15h ago•7 comments

Oban, the job processing framework from Elixir, has come to Python

https://www.dimamik.com/posts/oban_py/
4•dimamik•1h ago•0 comments

Show HN: I built a small browser engine from scratch in C++

https://github.com/beginner-jhj/mini_browser
23•crediblejhj•3h ago•0 comments

Rust at Scale: An Added Layer of Security for WhatsApp

https://engineering.fb.com/2026/01/27/security/rust-at-scale-security-whatsapp/
193•ubj•11h ago•75 comments

Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

https://github.com/trycua/cua
21•someguy101010•1d ago•2 comments

There's only one Woz, but we can all learn from him

https://www.fastcompany.com/91477114/steve-wozniak-woz-apple-the-tech-interactive-humanitarian-award
252•coloneltcb•4d ago•122 comments

Immanuel 'the Königsberg clock' Kant (2015)

https://www.versobooks.com/en-gb/blogs/news/1963-immanuel-kant-the-errrr-walker
5•rishabhd•3d ago•0 comments

Show HN: Extracting React apps from Figma Make's undocumented binary format

https://albertsikkema.com/ai/development/tools/reverse-engineering/2026/01/23/reverse-engineering...
34•albertsikkema•5d ago•8 comments

A few random notes from Claude coding quite a bit last few weeks

https://twitter.com/karpathy/status/2015883857489522876
818•bigwheels•1d ago•708 comments

Kyber (YC W23) Is Hiring a Staff Engineer

https://www.ycombinator.com/companies/kyber/jobs/GPJkv5v-staff-engineer-tech-lead
1•asontha•5h ago

Prism

https://openai.com/index/introducing-prism
729•meetpateltech•23h ago•483 comments

Show HN: Build Web Automations via Demonstration

https://www.notte.cc/launch-week-i/demonstrate-mode
19•ogandreakiro•1d ago•7 comments

SVG Path Editor

https://yqnn.github.io/svg-path-editor/
198•gurjeet•5d ago•30 comments

Virtual Boy on TV with Intelligent Systems Video Boy

https://hcs64.com/video-boy-vue/
75•hcs•9h ago•21 comments

Amazon axes 16,000 jobs as it pushes AI and efficiency

https://www.reuters.com/legal/litigation/amazon-cuts-16000-jobs-globally-broader-restructuring-20...
249•DGAP•2h ago•281 comments

Make.ts

https://matklad.github.io/2026/01/27/make-ts.html
173•ingve•10h ago•94 comments

430k-year-old well-preserved wooden tools are the oldest ever found

https://www.nytimes.com/2026/01/26/science/archaeology-neanderthals-tools.html
469•bookofjoe•1d ago•244 comments

Golden Ratio using an equilateral triangle inscribed in a circle

https://geometrycode.com/free/how-to-graphically-derive-the-golden-ratio-using-an-equilateral-tri...
148•peter_d_sherman•5d ago•41 comments

Pandas 3.0

https://pandas.pydata.org/community/blog/pandas-3.0.html
216•jonbaer•5d ago•80 comments

Thirty Years of the Square Kilometre Array

https://physicsworld.com/a/thirty-years-of-the-square-kilometre-array-heres-what-the-worlds-large...
55•mooreds•2d ago•15 comments

Rust’s Standard Library on the GPU

https://www.vectorware.com/blog/rust-std-on-gpu/
239•justaboutanyone•4d ago•47 comments

Doing the thing is doing the thing

https://www.softwaredesign.ing/blog/doing-the-thing-is-doing-the-thing
511•prakhar897•1d ago•172 comments

I Made a MIT Licensed Mecrisp-Stellaris Language Server

https://mecrisp-stellaris-folkdoc.sourceforge.io/mecrisp-stellaris-lsp.html
22•oldguy101•3d ago•3 comments