The core idea: LLMs are great at understanding structure but wasteful for bulk data extraction. So smelt uses a two-pass architecture:
1. A fast Go capture layer parses the document and detects table-like regions 2. Those regions (not the whole document) get sent to Claude for schema inference — column names, types, nesting 3. The Go layer then does deterministic extraction using the inferred schema
This means the LLM is never in the hot path of actual data processing. It figures out "what is this data?" once, and then Go handles the "extract 10,000 rows" part efficiently.
Usage is simple:
smelt invoice.pdf --format json
smelt https://example.com/pricing --format csv
smelt report.pdf --schema # just show the inferred structure
You can also pass --query "extract the revenue table" to focus extraction when a document has multiple tables.Still early (no OCR yet, HTML is limited to <table> elements), but it handles the common cases well. Would love feedback on the architecture — especially from anyone who's dealt with PDF table extraction at scale.