frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Robust LLM Extractor for Websites in TypeScript

https://github.com/lightfeed/extractor
14•andrew_zhong•2h ago
We've been building data pipelines that scrape websites and extract structured data for a while now. If you've done this, you know the drill: you write CSS selectors, the site changes its layout, everything breaks at 2am, and you spend your morning rewriting parsers.

LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:

- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.

We got tired of rebuilding this stack for every project, so we extracted it into a library.

Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:

- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning - Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.) - Uses Zod schemas for type-safe extraction with real validation - Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19 - Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches - Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction

We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.

GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.

Happy to answer questions or hear feedback.

Comments

plastic041•59m ago
> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.

And it doesn't care about robots.txt.

Flux159•47m ago
This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?

Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?

Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.

zx8080•28m ago
Robots.txt anyone?
sheept•13m ago
> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.

This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.

A Geometric Resolution of the Vacuum Catastrophe via 3-Torus Topology

https://drive.google.com/file/d/1NUxRyGn7P72ptlCYsoZcxRdI3Xa0e6Gd/view?usp=sharing
1•avonmach•33s ago•0 comments

How are teachers handling writing feedback at scale?

1•uuuAA•1m ago•0 comments

LiteLLM Supply Chain Attack: Defense in Depth Is the Only AI Security Strategy

https://www.runtimeai.io/blog-litellm-attack.html
2•roshanshaik•18m ago•0 comments

Zipcode specific inflation to understand local price changes

https://whatchanged.us/
1•ryan_j_naughton•22m ago•0 comments

Show HN: Spectator – A programming language for Cybersecurity and Hacking

1•CzaxTanmay•22m ago•0 comments

Spotting issues in DeFi with dimensional analysis

https://blog.trailofbits.com/2026/03/24/spotting-issues-in-defi-with-dimensional-analysis/
1•anitil•25m ago•1 comments

Iran rejects US proposal, lays out five conditions for ending war

https://www.presstv.ir/Detail/2026/03/25/765835/iran-rejects-us-proposal-lays-out-five-conditions...
1•Jimmc414•27m ago•0 comments

OmniWM – Niri and Dwindle tiling window manager for macOS

https://github.com/BarutSRB/OmniWM
1•gedy•38m ago•0 comments

Should Investors Demand Better Liquidation Terms for SAFEs?

https://natlawreview.com/article/should-investors-demand-better-liquidation-terms-safes
1•petethomas•40m ago•0 comments

Injecting Tracing the Hot Way

https://underjord.io/injecting-tracing-the-hot-way.html
1•lawik•43m ago•0 comments

The coming PLG to SLG apocalypse

https://www.withsahel.com/blog/plg-to-enterprise-timeline-compression
1•iajiboye•45m ago•1 comments

Ask HN: Can I somehow exit HN desktop view on mobile?

1•hxugufjfjf•47m ago•1 comments

Show HN: AutoSW-Like AutoResearch but for software:SW Systems that Builds itself

https://pub.towardsai.net/the-software-that-built-itself-well-defined-intents-are-all-you-need-06...
1•alexcpn•48m ago•1 comments

Show HN: Orloj – agent infrastructure as code (YAML and GitOps)

https://github.com/OrlojHQ/orloj
1•An0n_Jon•48m ago•0 comments

AI Arbitrator

https://www.adr.org/ai-arbitrator/
1•dqv•52m ago•0 comments

Show HN: Scope – a beautiful open-source web client for Stremio

https://github.com/scope-player/scope
1•judekim•55m ago•1 comments

Show HN: Agent Ruler new update v0.1.9

1•steadeepanda•58m ago•1 comments

Show HN: I built a lawyer game with AI

https://legalarena.app
1•divyanthj•59m ago•0 comments

Proactively Parasocial

https://nicklandolfi.com/posts/proactively-parasocial.html
1•jxmorris12•1h ago•0 comments

One man company is possible

https://mdalpha.ai/
1•sam890306•1h ago•1 comments

Show HN: Prompt Guard–MitM proxy that blocks secrets before they reach AI APIs

https://github.com/chaudharydeepak/prompt-guard
2•chaudharydeepak•1h ago•2 comments

I didn't understand TurboQuant, so I made this explainer

https://cyrusradfar.com/thoughts/turboquant-explainer
3•cyrusradfar•1h ago•0 comments

Terafab

https://en.wikipedia.org/wiki/Terafab
1•gnarlouse•1h ago•1 comments

My 1k Favorite Books

https://kylebenzle.com/bookreviews.php
1•hilliardfarmer•1h ago•0 comments

Show HN: Agent Kernel – Three Markdown files that make any AI agent stateful

https://agent-kernel.dev/?v=1.0.0
1•obilgic•1h ago•0 comments

HN: Surviving the litellm supply chain attack with a pure ctypes OS Vault

https://github.com/MACCRE-2026/MACCRE-Sovereign-Auth
2•MACCRE•1h ago•0 comments

Creating Readerly

https://nicoverbruggen.be/blog/creating-readerly
3•ganksalot•1h ago•0 comments

Squirrel seen 'vaping' in London park

https://www.telegraph.co.uk/news/2026/03/23/squirrel-seen-vaping-in-london-park/
18•walterbell•1h ago•3 comments

Research Shows Verbatim Recall of Copyrighted Books in LLMs

https://cauchy221.github.io/Alignment-Whack-a-Mole/
2•nsagent•1h ago•0 comments

Show HN: Send Love, Send Letters (at cost)

https://mappymail.com
2•pruetj•1h ago•0 comments