Today's agents either scrape (no consent, no structure) or the site builds a separate API (expensive, doesn't cover the long tail). The web's original protocols assumed someone is looking at a screen. That assumption is breaking.
We wrote a whitepaper mapping the full protocol landscape - Cloudflare's Pay Per Crawl and Web Bot Auth (RFC 9421), MCP, A2A, x402, llms.txt - and categorizing 5 distinct agent architectures (text-based, CUA/screenshot, DOM-based, API-calling, hybrid). Each needs different discovery, execution, and identity mechanisms. We think MCP, A2A, and execution protocols are complementary layers, not competitors. The paper draws parallels to TCP/HTTP design decisions.
Rover is our attempt at the execution layer. It's a DOM-native SDK the site owner installs. The Agent Task Protocol is one HTTP endpoint: POST /v1/tasks with { url, prompt }. Agents get back a task URL supporting JSON polling, SSE, or NDJSON. The site controls what agents can do and gets analytics on what they actually did. We're probably wrong about some of this -- would appreciate the feedback.
Paper: https://www.rtrvr.ai/blog/agent-web-protocol-stack Code: https://github.com/rtrvr-ai/rover (FSL-1.1-Apache-2.0)