Show HN: Robust LLM Extractor for Websites in TypeScript

24•andrew_zhong•2h ago

We've been building data pipelines that scrape websites and extract structured data for a while now. If you've done this, you know the drill: you write CSS selectors, the site changes its layout, everything breaks at 2am, and you spend your morning rewriting parsers.

LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:

- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.

We got tired of rebuilding this stack for every project, so we extracted it into a library.

Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:

- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning - Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.) - Uses Zod schemas for type-safe extraction with real validation - Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19 - Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches - Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction

We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.

GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.

Happy to answer questions or hear feedback.

Comments

plastic041•1h ago

> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.

And it doesn't care about robots.txt.

andrew_zhong•53m ago

Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.

Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.

messe•18m ago

> It's not about bypassing access restrictions.

Yes. It is. You've just made an arbitrary choice not to define it as such.

Flux159•1h ago

This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?

Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?

Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.

zx8080•1h ago

Robots.txt anyone?

andrew_zhong•53m ago

reyqn•30m ago

https://news.ycombinator.com/item?id=47340079

sheept•1h ago

> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.

This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.

andrew_zhong•51m ago

Yeah that's a good observation. XML's closing tags give the model structural anchors during generation — it knows where it is in the nesting. JSON doesn't have that, so the deeper the nesting the more likely the model loses track of brackets.

We see this especially with arrays of objects where each object has optional nested fields. The model will get 18 items right and then drop a closing bracket on item 19, or a invalid field of wrong type. That's why we put effort into the repair/recovery/sanitization layer — validate field-by-field and keep what's valid rather than throwing everything out.

dmos62•58m ago

What's your experience with not getting blocked by anti-bot systems? I see you've custom patches for that.

andrew_zhong•34m ago

The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — fixing CDP leaks, removing automation flags, etc. For sites behind Cloudflare or Datadome, that alone usually isn't enough — you'll need residential proxies and proper browser fingerprints on top. The library supports connecting to remote scraping browsers via WebSocket and proxy configuration for those cases.

AirMax98•15m ago

This feels like slop to me.

It may or may not be, but if you want people to actually use this product I’d suggest improving your documentation and replies here to not look like raw Claude output.

I also doubt the premise that about malformed JSON. I have never encountered anything like what you are describing with structured outputs.

Show HN: A plain-text cognitive architecture for Claude Code

Show HN: Optio – Orchestrate AI coding agents in K8s to go from ticket to PR

Show HN: Robust LLM Extractor for Websites in TypeScript

Show HN: I took back Video.js after 16 years and we rewrote it to be 88% smaller

Show HN: Yoink – Spotify to lossless with full metadata, self-hostable, ad-free

Show HN: Pgsemantic – Point at your Postgres DB, get vector search instantly

Show HN: DuckDB community extension for prefiltered HNSW using ACORN-1

Show HN: AI Roundtable – Let 200 models debate your question

Show HN: Automate your workflow in plain English

Show HN: ProofShot – Give AI coding agents eyes to verify the UI they build

Show HN: Email.md – Markdown to responsive, email-safe HTML

Show HN: I built a site that maps the web from a bounty hunter's perspective

Show HN: Gemini can now natively embed video, so I built sub-second video search

Show HN: Gridland: make terminal apps that also run in the browser

Show HN: Cq – Stack Overflow for AI coding agents

Show HN: Jmail Launches Jcal

Show HN: I coded Podhoc – Podcast generator to learn on the go

Show HN: Druids – coordinate and deploy coding agents across machines

Show HN: Stella Foster – iMessage on Any Phone

Show HN: Rick – Open-source AI CEO that autonomously runs your startup

Show HN: GhostDesk – MCP server giving AI agents a full virtual Linux desktop

Show HN: Starlink constellation health – 108 reentry anomalies in TLE data

Show HN: E is for ENSHITTIFICATION – An illustrated children's book on big tech

Show HN: clickity – mechanical keyboard click sounds when you type on macOS

Show HN: I ran a language model on a PS2

Show HN: I built a voice AI that responds like a real woman

Show HN: The King Wen Permutation: [52, 10, 2]

Show HN: Τ³-Bench is out – can agents handle complex docs and live calls?

Show HN: I built an integration for RL training of browser agents for everyone

Show HN: As Notes – A Static Site Generator in Your Markdown Knowledgebase

Show HN: Robust LLM Extractor for Websites in TypeScript

Comments

Show HN: A plain-text cognitive architecture for Claude Code

Show HN: Optio – Orchestrate AI coding agents in K8s to go from ticket to PR

Show HN: Robust LLM Extractor for Websites in TypeScript

Show HN: I took back Video.js after 16 years and we rewrote it to be 88% smaller

Show HN: Yoink – Spotify to lossless with full metadata, self-hostable, ad-free

Show HN: Pgsemantic – Point at your Postgres DB, get vector search instantly

Show HN: DuckDB community extension for prefiltered HNSW using ACORN-1

Show HN: AI Roundtable – Let 200 models debate your question

Show HN: Automate your workflow in plain English

Show HN: ProofShot – Give AI coding agents eyes to verify the UI they build

Show HN: Email.md – Markdown to responsive, email-safe HTML

Show HN: I built a site that maps the web from a bounty hunter's perspective

Show HN: Gemini can now natively embed video, so I built sub-second video search

Show HN: Gridland: make terminal apps that also run in the browser

Show HN: Cq – Stack Overflow for AI coding agents

Show HN: Jmail Launches Jcal

Show HN: I coded Podhoc – Podcast generator to learn on the go

Show HN: Druids – coordinate and deploy coding agents across machines

Show HN: Stella Foster – iMessage on Any Phone

Show HN: Rick – Open-source AI CEO that autonomously runs your startup

Show HN: GhostDesk – MCP server giving AI agents a full virtual Linux desktop

Show HN: Starlink constellation health – 108 reentry anomalies in TLE data

Show HN: E is for ENSHITTIFICATION – An illustrated children's book on big tech

Show HN: clickity – mechanical keyboard click sounds when you type on macOS

Show HN: I ran a language model on a PS2

Show HN: I built a voice AI that responds like a real woman

Show HN: The King Wen Permutation: [52, 10, 2]

Show HN: Τ³-Bench is out – can agents handle complex docs and live calls?

Show HN: I built an integration for RL training of browser agents for everyone

Show HN: As Notes – A Static Site Generator in Your Markdown Knowledgebase