I built Sentinel after using browser-use and Stagehand on a client
project and hitting two recurring issues: flaky reliability on
multi-step flows, and token costs that ate the budget on anything
non-trivial. I suspected the root cause was architectural - both
lean on the LLM re-reading large portions of the page each step -
and tried Chrome's Accessibility Object Model (AOM) as the
observation layer instead.
To check whether that architectural choice actually mattered, I
built a 9-task benchmark comparing Sentinel, Stagehand, and
browser-use against the same Gemini 3 Flash Preview model, same
prompts, same programmatic validators, 5 runs per task-tool combo.
Raw per-run JSON is committed so you can recompute or challenge
every number.
Headline numbers:
- Tokens: Sentinel uses 3.1x-56.9x fewer than browser-use,
1.4x-13.3x fewer than Stagehand.
- Reliability: Sentinel 100% (45/45), browser-use 100% (45/45),
Stagehand 86.7% (39/45).
- Speed: Sentinel is fastest on 5 of 9 tasks.
- The harder the task, the bigger the token gap.
Caveats up front:
- I built Sentinel - treat this as a starting point for your own
verification, not an impartial survey. README has a full
known-limitations section.
- Single model (Gemini 3 Flash Preview, which is also Stagehand's
documented recommendation).
- 9 tasks is small; raw JSON is there if you want to add tasks
or rerun on a different model.
- Each framework is used with its idiomatic API (Sentinel/Stagehand:
discrete act()/extract(); browser-use: agent-loop prompt).
Forcing them into the same call pattern would disadvantage
whichever is optimized for the other.
Sentinel is already in production with paying clients (all
self-hosted), which covers development costs.
A managed offering is on the table
if there's real demand: you'd pay infra + model usage at cost, no
margin. Drop a comment if that would unblock you, otherwise I'd
rather not maintain hosting nobody needs.
isoldex•31m ago
I built Sentinel after using browser-use and Stagehand on a client project and hitting two recurring issues: flaky reliability on multi-step flows, and token costs that ate the budget on anything non-trivial. I suspected the root cause was architectural - both lean on the LLM re-reading large portions of the page each step - and tried Chrome's Accessibility Object Model (AOM) as the observation layer instead.
To check whether that architectural choice actually mattered, I built a 9-task benchmark comparing Sentinel, Stagehand, and browser-use against the same Gemini 3 Flash Preview model, same prompts, same programmatic validators, 5 runs per task-tool combo. Raw per-run JSON is committed so you can recompute or challenge every number.
Headline numbers: - Tokens: Sentinel uses 3.1x-56.9x fewer than browser-use, 1.4x-13.3x fewer than Stagehand. - Reliability: Sentinel 100% (45/45), browser-use 100% (45/45), Stagehand 86.7% (39/45). - Speed: Sentinel is fastest on 5 of 9 tasks. - The harder the task, the bigger the token gap.
Caveats up front: - I built Sentinel - treat this as a starting point for your own verification, not an impartial survey. README has a full known-limitations section. - Single model (Gemini 3 Flash Preview, which is also Stagehand's documented recommendation). - 9 tasks is small; raw JSON is there if you want to add tasks or rerun on a different model. - Each framework is used with its idiomatic API (Sentinel/Stagehand: discrete act()/extract(); browser-use: agent-loop prompt). Forcing them into the same call pattern would disadvantage whichever is optimized for the other.
Sentinel is already in production with paying clients (all self-hosted), which covers development costs. A managed offering is on the table if there's real demand: you'd pay infra + model usage at cost, no margin. Drop a comment if that would unblock you, otherwise I'd rather not maintain hosting nobody needs.