frontpage.

While working with LLMs for structured web data extraction, we saw issues with invalid JSON and broken links in the output. This led me to build a library focused on robust extraction and enrichment:

- Clean HTML conversion: transforms HTML into LLM-friendly markdown with an option to extract just the main content - LLM structured output: Uses Gemini 2.5 flash or GPT-4o mini to balance accuracy and cost. Can also also use custom prompt - JSON sanitization: If the LLM structured output fails or doesn't fully match your schema, a sanitization process attempts to recover and fix the data, especially useful for deeply nested objects and arrays - URL validation: all extracted URLs are validated - handling relative URLs, removing invalid ones, and repairing markdown-escaped links

I'd love to hear if anyone else has experimented with LLMs for data extraction or if you have any questions about this approach!

GPT-4.1 will be available directly in ChatGPT starting today

Details about Lava: Airbnb's new animation format

Oniux: Kernel-level Tor isolation for any Linux app

Kids Online Safety Act is back

Launch day for Maple AI – a new privacy AI in the Apple App Store with E2EE

Dropbox Is Down

My Microsoft MultiMedia Keyboard 1.0A is dead

Yume – Transform your content with GPU shaders

Unhappy Meals (2007)

How to systematically secure anything (2023)

A Framework for Defining and Refining Your ICP

Keeping Time on a Stream

Beyond the Wrist: Debugging RSI

54 years ago, a computer programmer fixed a bug, created an existential crisis

How the 'end of history' illusion shapes your life choices

Warners Reverses Course: Changes Max's Name Back to HBO Max

How we (re)built our AI agent for code reviews in IDEs

In search of a dynamist vision for safe superhuman AI

ONOX: The all-electric tractor with swappable battery packs

Trump admin ends extreme weather database that has tracked cost of disasters

Learning pointers 10 years too late

CISA changes vulnerabilities updates, shifts to X and emails

TikTok is using AI-generated alt text to describe photos

Students Are Short-Circuiting Their Chromebooks for a Social Media Challenge

Show HN: I made a client MCP react app for Supabase's MCP server

China and Russia have signed a deal to build a nuclear power plant on the moon

Free AI Code Reviews for Cursor, Windsurf and VS Code: CodeRabbit in IDE

Cell cycle duration determines oncogenic transformation capacity

Show HN: Turn any workflow diagram into compilable, running and stateful code

Teaching Kids about Money in the Age of Tap to Pay

Show HN: Robust LLM Extractor for HTML/Markdown in TypeScript