Traditional databases only work over data that's already loaded and cleaned. But in the real world, data lives everywhere — in files, PDFs, web pages, APIs. To query it, we usually need custom ETL pipelines: extract, clean, transform, load. It’s slow, brittle, and different every time.
SwellDB flips that model: you define a table (schema + a description as a natural language prompt) and it generates the table just-in-time — using LLMs and your schema/prompt, on top of the connected data sources (files, databases, LLMs, web). Think: querying a DataFrame that materializes itself from raw input without you writing the ingestion logic.
It supports:
- Structured + unstructured sources: CSV, SQL, web search results (PDF to be added soon)
- Declarative table definitions in Python
- Output compatible with any SQL query engine (DuckDB, Apache DataFusion) or ingestible into any database
Repo: https://github.com/SwellDB/SwellDB
Short paper (4 pages): https://github.com/gsvic/gsvic.github.io/blob/gh-pages/paper...
Would love feedback if you get a chance to try it out, especially from folks dealing with hybrid or messy data sources.
lisa_coicadan•6mo ago
We’re building something in a similar space at Retab.com, but with a different philosophy: instead of querying live across unstructured sources, we focus on reliably turning raw inputs (PDFs, scanned docs, images, etc.) into clean, structured outputs, using schema-guided LLM generation, multi-model consensus, and an evaluation dashboard. So it’s less about on-the-fly queries, and more about building robust pipelines where you can trust the output and audit how it was produced. Curious if you’ve thought about integrating evaluation or schema validation layers downstream, or if SwellDB is mainly about exploration? Excited to follow the project either way!