Hi HN! I built TextWeb because I was burning tokens on vision models just to let AI agents fill out job applications.
TextWeb renders pages as structured text grids (~2-5KB) instead of screenshots (~1MB). Any LLM can read the output natively, no vision model needed. Interactive elements get reference numbers like [3]Click me and [7:____] Search, so agents say "click 3" or "type 7 hello".
How it works: Headless Chromium renders the page normally, then TextWeb extracts every visible element's position, text, and interactivity and maps it onto a character grid. Spatial layout is preserved. Things next to each other on screen are next to each other in text.
cdr420•1h ago
TextWeb renders pages as structured text grids (~2-5KB) instead of screenshots (~1MB). Any LLM can read the output natively, no vision model needed. Interactive elements get reference numbers like [3]Click me and [7:____] Search, so agents say "click 3" or "type 7 hello".
How it works: Headless Chromium renders the page normally, then TextWeb extracts every visible element's position, text, and interactivity and maps it onto a character grid. Spatial layout is preserved. Things next to each other on screen are next to each other in text.
Integrations: MCP server (Claude Desktop, Cursor, Windsurf), OpenAI/Anthropic function-calling tool definitions, LangChain, CrewAI, HTTP API, CLI, Node.js library.
Would love feedback on the grid format and what integrations would be most useful.