Even with: - accessibility snapshots - element references (E1, E2) - semantic locators - session isolation
they still feel fundamentally fragile.
LLMs are reasoning over DOM trees step by step. It works — but barely. Small UI changes break everything.
It feels like we’re missing an abstraction layer.
What if instead of agents operating on markup, websites exposed structured “interaction surfaces” — something closer to tools or world models rather than DOM nodes?
Instead of: - parse DOM - guess selector - click element
It would be: - request action - receive structured state - operate over stable semantic primitives
Is this already being explored somewhere beyond MCP experiments? Or is everyone still stuck in DOM-land?
Curious if others see the same limitation — and whether a middleware “site-agent” layer makes sense.
Would love to hear your thoughts
andsoitis•1h ago
For instance, see “Computer use” in the recent Sonnet 4.6 announcement: https://www.anthropic.com/news/claude-sonnet-4-6
AS_YC•1h ago