frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A verification layer for browser agents: Amazon case study

https://sentienceapi.com/blog/verification-layer-amazon-case-study
1•tonyww•1h ago
A common approach to automating Amazon shopping or similar complex websites is to reach for large cloud models (often vision-capable). I wanted to test a contradiction: can a ~3B parameter local LLM model complete the flow using only structural page data (DOM) plus deterministic assertions?

This post summarizes four runs of the same task (search → first product → add to cart → checkout on Amazon). The key comparison is Demo 0 (cloud baseline) vs Demo 3 (local autonomy); Demos 1–2 are intermediate controls.

More technical detail (architecture, code excerpts, additional log snippets):

https://www.sentienceapi.com/blog/verification-layer-amazon-...

Demo 0 vs Demo 3:

Demo 0 (cloud, GLM‑4.6 + structured snapshots) success: 1/1 run tokens: 19,956 (~43% reduction vs ~35k estimate) time: ~60,000ms cost: cloud API (varies) vision: not required

Demo 3 (local, DeepSeek R1 planner + Qwen ~3B executor) success: 7/7 steps (re-run) tokens: 11,114 time: 405,740ms cost: $0.00 incremental (local inference) vision: not required

Latency note: the local stack is slower end-to-end here largely because inference runs on local hardware (Mac Studio with M4); the cloud baseline benefits from hosted inference, but has per-token API cost.

Architecture

This worked because we changed the control plane and added a verification loop.

1) Constrain what the model sees (DOM pruning). We don’t feed the entire DOM or screenshots. We collect raw elements, then run a WASM pass to produce a compact “semantic snapshot” (roles/text/geometry) and prune the rest (often on the order of ~95% of nodes).

2) Split reasoning from acting (planner vs executor).

Planner (reasoning): DeepSeek R1 (local) generates step intent + what must be true afterward. Executor (action): Qwen ~3B (local) selects concrete DOM actions like CLICK(id) / TYPE(text). 3) Gate every step with Jest‑style verification. After each action, we assert state changes (URL changed, element exists/doesn’t exist, modal/drawer appeared). If a required assertion fails, the step fails with artifacts and bounded retries.

Minimal shape:

ok = await runtime.check( exists("role=textbox"), label="search_box_visible", required=True, ).eventually(timeout_s=10.0, poll_s=0.25, max_snapshot_attempts=3)

What changed between “agents that look smart” and agents that work Two examples from the logs:

Deterministic override to enforce “first result” intent: “Executor decision … [override] first_product_link -> CLICK(1022)”

Drawer handling that verifies and forces the correct branch: “result: PASS | add_to_cart_verified_after_drawer”

The important point is that these are not post‑hoc analytics. They are inline gates: the system either proves it made progress or it stops and recovers.

Takeaway If you’re trying to make browser agents reliable, the highest‑leverage move isn’t a bigger model. It’s constraining the state space and making success/failure explicit with per-step assertions.

Reliability in agents comes from verification (assertions on structured snapshots), not just scaling model size.

Comments

tonyww•54m ago
A quick clarification on intent, since “browser automation” means different things to different people:

This isn’t about making scripts smarter or replacing Playwright/Selenium. The problem I’m exploring is reliability: how to make agent-driven browser execution fail deterministically and explainably instead of half-working when layouts change.

Concretely, the agent doesn’t just “click and hope”. Each step is gated by explicit post-conditions, similar to how tests assert outcomes:

---- ## Python Code Example:

ready = runtime.assert_( all_of(url_contains("checkout"), exists("role=button")), "checkout_ready", required=True )

----

If the condition isn’t met, the run stops with artifacts instead of drifting forward. Vision models are optional fallbacks, not the primary control signal.

Happy to answer questions about the design tradeoffs or where this approach falls short

Health Insurers in Shock After Medicare Holds Line on 2027 Payments

https://www.wsj.com/health/healthcare/shock-and-dismay-among-health-insurers-after-medicare-holds...
1•petethomas•2m ago•0 comments

SQL Injection Cheat Sheet

https://www.invicti.com/blog/web-security/sql-injection-cheat-sheet
1•behnamoh•4m ago•0 comments

Peter H. Duesberg, 89, Renowned Biologist Turned HIV Denialist, Dies

https://www.nytimes.com/2026/01/27/science/peter-duesberg-dead.html
1•toomanyrichies•8m ago•1 comments

Modern Law of Leaky Abstractions

https://codecube.net/2026/1/modern-law-leaky-abstractions/
1•CodeCube•12m ago•0 comments

Ask HN: Why do people not like VimScript

1•cirnovsky•13m ago•0 comments

Google to pay $68M over allegations its voice assistant eavesdropped on users

https://www.cbsnews.com/news/google-voice-assistant-lawsuit-settlement-68-million/
1•iamnothere•14m ago•0 comments

Show HN: My AI tracks Polymarket whales with guardrails so it won't bankrupt me

https://predictor-dashboard.vercel.app
1•JackDavis720•15m ago•0 comments

Android's full desktop interface leaks: New status bar, Chrome Extensions

https://9to5google.com/2026/01/27/android-desktop-leak/
1•thunderbong•18m ago•0 comments

Amazon Discontinuing One Palm?

1•sshillo•19m ago•0 comments

Ups retires its fleet of MD-11 cargo aircraft

https://www.pbs.org/newshour/nation/ups-retires-its-fleet-of-md-11-cargo-aircraft-involved-in-dea...
2•canucker2016•28m ago•1 comments

Rye pollen's cancer-fighting structure revealed for first time

https://phys.org/news/2026-01-rye-pollen-cancer-revealed.html
1•PaulHoule•29m ago•0 comments

Book: "Beading with Algorithms: Cellular Automata in Peyote Stitch."

https://mathstodon.xyz/@gwenbeads/115968206227675487
1•sohkamyung•30m ago•0 comments

Where can I find startups looking for fractional product leads?

1•stulogy•31m ago•0 comments

Who Contributed to PostgreSQL Development in 2025?

http://rhaas.blogspot.com/2026/01/who-contributed-to-postgresql.html
1•pabs3•35m ago•0 comments

Ask HN: In agent/automation incidents, what slows recovery?

1•paulrekai•39m ago•0 comments

The Librarians Film

https://thelibrariansfilm.com/
1•JumpCrisscross•40m ago•0 comments

'A militia that kills': uproar in Italy over ICE security role in Italy

https://www.theguardian.com/us-news/2026/jan/27/italy-ice-security-role-winter-olympics
6•KnuthIsGod•40m ago•1 comments

Blur any element on webpage for safer demos, screenshots, and screen sharing

https://github.com/KD-MM2/BlurShot
1•kaotd•43m ago•0 comments

Pretend to work: China's novel solution to youth unemployment

https://www.theaustralian.com.au/world/social-pressure-in-china-drives-the-jobless-to-fake-it-til...
1•Anon84•44m ago•1 comments

He Leaked Secrets of a Southeast Asian Scam Compound. He Had to Get Out Alive

https://www.wired.com/story/he-leaked-the-secrets-southeast-asian-scam-compound-then-had-to-get-o...
1•YeGoblynQueenne•45m ago•0 comments

Writing a browser with half a developer and ELIZA in 1 hours, 76 lines of C

https://www.hgreer.com/QuoteBrowserUnquote/
2•QuadmasterXLII•47m ago•1 comments

Measuring US workers' capacity to adapt to AI-driven job displacement

https://www.brookings.edu/articles/measuring-us-workers-capacity-to-adapt-to-ai-driven-job-displa...
3•cebert•48m ago•1 comments

Alex Pretti broke rib in violent confrontation with ICE days before he was shot

https://www.dailymail.co.uk/news/article-15502789/alex-pretti-federal-agents-shot-dead-minneapoli...
4•Bender•48m ago•3 comments

Nvidia's New Voice AI – low latency

https://www.youtube.com/watch?v=n_m0fqp8xwQ
2•mdani•49m ago•0 comments

Pope Leo makes plea for men to stop talking to fake online girlfriends

https://www.dailymail.co.uk/sciencetech/article-15502247/Pope-Leo-affectionate-chatbots-AI.html
3•Bender•50m ago•0 comments

List of stories set in a future now in the past

https://en.wikipedia.org/wiki/List_of_stories_set_in_a_future_now_in_the_past
2•Jugurtha•50m ago•0 comments

The official source for MDN Web Docs content

https://github.com/mdn/content
1•imwally•51m ago•0 comments

Coffee pods urgently recalled over health risk posed to 120M Americans

https://www.dailymail.co.uk/health/article-15502769/keurig-mccafe-coffee-pods-decaf-recall-caffei...
1•Bender•52m ago•0 comments

Show HN: Infinijest, video scrolling experiment no login

1•hnthrowawaste•53m ago•0 comments

Agents Need a Map

https://www.intent-systems.com/learn/intent-layer
2•contextty•53m ago•1 comments