I initially started with Google Gemini 3 Flash but I switched to Ollama + Ministral 3:3b. The extraction is not exhaustive and there is much to improve but this is working.
dwata runs locally, runs a web backend and the gui runs in browser. Connects to emails, downloads them. Then we can run the financial template detection. It checks for similar looking emails, grouped by sender. Then sends a sample from each cluster to LLM agent. The LLM is asked to find out the parts of text that look like the data we are looking for. dwata then searches for the variables/values that LLM gave in the email, creates a template by replacing the data with template tags. Saves template to DB. dwata parse the data from each email when extracting data.
Roadmap: There is a long way to go, the extractor needs to work much, much, better. dwata will also work on files soon (bank/CC statements).
I want to extract vendors, businesses, contacts, events, places, etc. Connect to different APIs and process everything locally.
dwata will be able to download and process data from Hacker News API too (or other similar sources) - extract entities you care about.
Eventually, only use Ollama/Llama.cpp with models that fit 6-8GB graphics cards or 16GB unified memory only!!
yubainu•2h ago