Traditional databases only work over data that's already loaded and cleaned. But in the real world, data lives everywhere — in files, PDFs, web pages, APIs. To query it, we usually need custom ETL pipelines: extract, clean, transform, load. It’s slow, brittle, and different every time.
SwellDB flips that model: you define a table (schema + a description as a natural language prompt) and it generates the table just-in-time — using LLMs and your schema/prompt, on top of the connected data sources (files, databases, LLMs, web). Think: querying a DataFrame that materializes itself from raw input without you writing the ingestion logic.
It supports:
- Structured + unstructured sources: CSV, SQL, web search results (PDF to be added soon)
- Declarative table definitions in Python
- Output compatible with any SQL query engine (DuckDB, Apache DataFusion) or ingestible into any database
Repo: https://github.com/SwellDB/SwellDB
Short paper (4 pages): https://github.com/gsvic/gsvic.github.io/blob/gh-pages/paper...
Would love feedback if you get a chance to try it out, especially from folks dealing with hybrid or messy data sources.