frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Robust LLM Extractor for Websites in TypeScript

https://github.com/lightfeed/extractor
15•andrew_zhong•2h ago
We've been building data pipelines that scrape websites and extract structured data for a while now. If you've done this, you know the drill: you write CSS selectors, the site changes its layout, everything breaks at 2am, and you spend your morning rewriting parsers.

LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:

- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.

We got tired of rebuilding this stack for every project, so we extracted it into a library.

Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:

- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning - Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.) - Uses Zod schemas for type-safe extraction with real validation - Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19 - Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches - Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction

We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.

GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.

Happy to answer questions or hear feedback.

Comments

plastic041•1h ago
> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.

And it doesn't care about robots.txt.

Flux159•51m ago
This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?

Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?

Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.

zx8080•32m ago
Robots.txt anyone?
sheept•17m ago
> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.

This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.

dmos62•3m ago
What's your experience with not getting blocked by anti-bot systems? I see you've custom patches for that.

Running Tesla Model 3's computer on my desk using parts from crashed cars

https://bugs.xdavidhu.me/tesla/2026/03/23/running-tesla-model-3s-computer-on-my-desk-using-parts-...
499•driesdep•8h ago•138 comments

ARC-AGI-3

https://arcprize.org/arc-agi/3
329•lairv•11h ago•207 comments

My astrophotography in the movie Project Hail Mary

https://rpastro.square.site/s/stories/phm
788•wallflower•3d ago•193 comments

Earthquake scientists reveal how overplowing weakens soil at experimental farm

https://www.washington.edu/news/2026/03/19/earthquake-scientists-reveal-how-overplowing-weakens-s...
137•Brajeshwar•15h ago•55 comments

Two studies in compiler optimisations

https://www.hmpcabral.com/2026/03/20/two-studies-in-compiler-optimisations/
48•hmpc•3d ago•1 comments

False claims in a widely-cited paper

https://statmodeling.stat.columbia.edu/2026/03/24/false-claims-in-a-published-no-corrections-no-c...
217•qsi•5h ago•72 comments

90% of Claude-linked output going to GitHub repos w <2 stars

https://www.claudescode.dev/?window=since_launch
232•louiereederson•11h ago•137 comments

The EU still wants to scan your private messages and photos

https://fightchatcontrol.eu/?foo=bar
888•MrBruh•9h ago•240 comments

My DIY FPGA board can run Quake II

https://blog.mikhe.ch/quake2-on-fpga/part4.html
105•sznio•3d ago•33 comments

Show HN: Robust LLM Extractor for Websites in TypeScript

https://github.com/lightfeed/extractor
15•andrew_zhong•2h ago•5 comments

Supreme Court Sides with Cox in Copyright Fight over Pirated Music

https://www.nytimes.com/2026/03/25/us/politics/supreme-court-cox-music-copyright.html
320•oj2828•14h ago•254 comments

The truth that haunts the Ramones: 'They sold more T-shirts than records'

https://english.elpais.com/culture/2026-03-17/the-uncomfortable-truth-that-will-always-haunt-the-...
53•c420•4d ago•22 comments

"Disregard That" Attacks

https://calpaterson.com/disregard.html
47•leontrolski•6h ago•26 comments

Quantization from the Ground Up

https://ngrok.com/blog/quantization
226•samwho•13h ago•45 comments

Apple randomly closes bug reports unless you "verify" the bug remains unfixed

https://lapcatsoftware.com/articles/2026/3/11.html
343•zdw•10h ago•192 comments

Show HN: A plain-text cognitive architecture for Claude Code

https://lab.puga.com.br/cog/
65•marciopuga•6h ago•20 comments

More precise elevation data for GraphHopper routing engine

https://www.graphhopper.com/blog/2026/03/23/more-precise-elevation-data-for-graphhopper/
8•karussell•2d ago•0 comments

Show HN: Optio – Orchestrate AI coding agents in K8s to go from ticket to PR

https://github.com/jonwiggins/optio
38•jawiggins•12h ago•22 comments

Do Architects Still Need to Draw? (2020)

https://www.lifeofanarchitect.com/do-architects-still-need-to-draw/
5•hbarka•4d ago•1 comments

Miscellanea: The War in Iran

https://acoup.blog/2026/03/25/miscellanea-the-war-in-iran/
489•decimalenough•1d ago•705 comments

Woman who never stopped updating her lost dog's chip reunites with him after 11y

https://www.cbc.ca/radio/asithappens/11-year-dog-reunion-9.7140780
144•gnabgib•6h ago•85 comments

Jury finds Meta liable in case over child sexual exploitation on its platforms

https://www.cnn.com/2026/03/24/tech/meta-new-mexico-trial-jury-deliberation
346•billfor•1d ago•450 comments

Squirrel seen 'vaping' in London park

https://www.telegraph.co.uk/news/2026/03/23/squirrel-seen-vaping-in-london-park/
18•walterbell•1h ago•3 comments

Thoughts on slowing the fuck down

https://mariozechner.at/posts/2026-03-25-thoughts-on-slowing-the-fuck-down/
761•jdkoeck•15h ago•362 comments

VitruvianOS – Desktop Linux Inspired by the BeOS

https://v-os.dev
346•felixding•1d ago•210 comments

Rendering complex scripts in terminal and OSC 66

https://thottingal.in/blog/2026/03/22/complex-scripts-in-terminal/
27•sthottingal•3d ago•5 comments

FreeCAD v1.1

https://blog.freecad.org/2026/03/25/freecad-version-1-1-released/
213•sho_hn•10h ago•70 comments

The Mystery of Rennes-Le-Château, Part 1: The Priest's Treasure

https://www.filfre.net/2026/03/the-mystery-of-rennes-le-chateau-part-1-the-priests-treasure/
17•ibobev•2d ago•1 comments

Sodium-ion EV battery breakthrough delivers 11-min charging and 450 km range

https://electrek.co/2026/03/25/sodium-ion-ev-battery-delivers-11-min-charging-450-km-range/
151•breve•9h ago•106 comments

I tried to prove I'm not AI. My aunt wasn't convinced

https://www.bbc.com/future/article/20260324-i-tried-to-prove-im-not-an-ai-deepfake
150•dabinat•19h ago•170 comments