I kept solving this problem from scratch on different projects, so I packaged it up as Reader, hoping it saves others the same headaches...
Two primitives:
const reader = new ReaderClient();
// Scrape URLs → clean markdown
const result = await reader.scrape({ urls: ["https://example.com"] });
// Crawl a site → discover + scrape pages
const pages = await reader.crawl({ url: "https://example.com", depth: 2 });
Under the hood it's built on Ulixee Hero, a headless browser designed for
anti-detection. The hard stuff like TLS fingerprinting, Cloudflare/Turnstile bypass, browser pool recycling, proxy rotation is built in.The HTML-to-markdown conversion uses supermarkdown, a Rust engine I built specifically for messy real world HTML. Clean output, no artifacts.
TypeScript first, full type safety, works as CLI or library. Apache 2.0 license.
GitHub: https://github.com/vakra-dev/reader
Happy to answer questions about the architecture, approach, or tradeoffs I made.
Would love feedback from anyone doing web scraping at scale, especially on edge cases where it breaks. That's how I can make this better.