fp.

TextWeb renders pages as structured text grids (~2-5KB) instead of screenshots (~1MB). Any LLM can read the output natively, no vision model needed. Interactive elements get reference numbers like [3]Click me and [7:____] Search, so agents say "click 3" or "type 7 hello".

How it works: Headless Chromium renders the page normally, then TextWeb extracts every visible element's position, text, and interactivity and maps it onto a character grid. Spatial layout is preserved. Things next to each other on screen are next to each other in text.

Integrations: MCP server (Claude Desktop, Cursor, Windsurf), OpenAI/Anthropic function-calling tool definitions, LangChain, CrewAI, HTTP API, CLI, Node.js library.

    npm install -g textweb

Would love feedback on the grid format and what integrations would be most useful.