frontpage.

I built a CLI tool in Go that extracts structured data (JSON, CSV, Parquet) from messy PDFs and HTML pages.

The core idea: LLMs are great at understanding structure but wasteful for bulk data extraction. So smelt uses a two-pass architecture:

1. A fast Go capture layer parses the document and detects table-like regions 2. Those regions (not the whole document) get sent to Claude for schema inference — column names, types, nesting 3. The Go layer then does deterministic extraction using the inferred schema

This means the LLM is never in the hot path of actual data processing. It figures out "what is this data?" once, and then Go handles the "extract 10,000 rows" part efficiently.

Usage is simple:

  smelt invoice.pdf --format json
  smelt https://example.com/pricing --format csv
  smelt report.pdf --schema   # just show the inferred structure

You can also pass --query "extract the revenue table" to focus extraction when a document has multiple tables.

Still early (no OCR yet, HTML is limited to <table> elements), but it handles the common cases well. Would love feedback on the architecture — especially from anyone who's dealt with PDF table extraction at scale.

Show HN: ANSI-Saver – A macOS Screensaver

Show HN: PKGSmith

Show HN: JotSpot – a super fast Markdown note tool with instant shareable pages

Show HN: Moongate – Ultima Online server emulator in .NET 10 with Lua scripting

Show HN: Somnia – a dream journal that locks 2 minutes after your alarm fires

Show HN: Bulk Image Generator – Create AI variations and remove bg in batch

Show HN: OSle – A 510 bytes OS in x86 assembly, now with a C API

Show HN: µJS, a 5KB alternative to Htmx and Turbo with zero dependencies

Show HN: Smelt – Extract structured data from PDFs and HTML using LLM

Show HN: Recruiter Analytics for Developer Portfolios

Show HN: Diamond – an interactive CLI for editing trees

Show HN: Kula – Lightweight, self-contained Linux server monitoring tool

Show HN: Claude-replay – A video-like player for Claude Code sessions

Show HN: OculOS – Any desktop app as a JSON API via OS accessibility tree

Show HN: I open-sourced my Steam game, 100% written in Lua, engine is also open

Show HN: 1v1 coding game that LLMs struggle with

Show HN: Nirvana – A TUI YouTube Music Player with a Physics-Based Visualizer

Show HN: Reconstruct any image using primitive shapes, runs in-browser via WASM

Show HN: A trainable, modular electronic nose for industrial use

Show HN: Making Braindance from Cyberpunk 2077 a reality

Show HN: Git-lanes – Parallel isolation for AI coding agents using Git worktrees

Show HN: Swarm – Program a colony of 200 ants using a custom assembly language

Show HN: Mb-CLI – CLI for Metabase. Designed for humans and AI coding agents

Show HN: Graph-Oriented Generation – Beating RAG for Codebases by 89%

Show HN: NeoNetrek – modernizing the internet's first team game (1988)

Show HN: Interactive 3D globe of EU shipping emissions

Show HN: Jido 2.0, Elixir Agent Framework

Show HN: PageAgent, A GUI agent that lives inside your web app

Show HN: Modembin – A pastebin that encodes your text into real FSK modem audio

Show HN: Open source drone that can hold cargo

Show HN: Smelt – Extract structured data from PDFs and HTML using LLM

Show HN: ANSI-Saver – A macOS Screensaver

Show HN: PKGSmith

Show HN: JotSpot – a super fast Markdown note tool with instant shareable pages

Show HN: Moongate – Ultima Online server emulator in .NET 10 with Lua scripting

Show HN: Somnia – a dream journal that locks 2 minutes after your alarm fires

Show HN: Bulk Image Generator – Create AI variations and remove bg in batch

Show HN: OSle – A 510 bytes OS in x86 assembly, now with a C API

Show HN: µJS, a 5KB alternative to Htmx and Turbo with zero dependencies

Show HN: Smelt – Extract structured data from PDFs and HTML using LLM

Show HN: Recruiter Analytics for Developer Portfolios

Show HN: Diamond – an interactive CLI for editing trees

Show HN: Kula – Lightweight, self-contained Linux server monitoring tool

Show HN: Claude-replay – A video-like player for Claude Code sessions

Show HN: OculOS – Any desktop app as a JSON API via OS accessibility tree

Show HN: I open-sourced my Steam game, 100% written in Lua, engine is also open

Show HN: 1v1 coding game that LLMs struggle with

Show HN: Nirvana – A TUI YouTube Music Player with a Physics-Based Visualizer

Show HN: Reconstruct any image using primitive shapes, runs in-browser via WASM

Show HN: A trainable, modular electronic nose for industrial use

Show HN: Making Braindance from Cyberpunk 2077 a reality

Show HN: Git-lanes – Parallel isolation for AI coding agents using Git worktrees

Show HN: Swarm – Program a colony of 200 ants using a custom assembly language

Show HN: Mb-CLI – CLI for Metabase. Designed for humans and AI coding agents

Show HN: Graph-Oriented Generation – Beating RAG for Codebases by 89%

Show HN: NeoNetrek – modernizing the internet's first team game (1988)

Show HN: Interactive 3D globe of EU shipping emissions

Show HN: Jido 2.0, Elixir Agent Framework

Show HN: PageAgent, A GUI agent that lives inside your web app

Show HN: Modembin – A pastebin that encodes your text into real FSK modem audio

Show HN: Open source drone that can hold cargo