- no labelled data at all
- lack coverage and diversity in existing data
- Data collection and annotation processes are slow and boring
- Not enough examples to fine-tune or evaluate LLMs…
So I built datafast, an open-source library for synthetic text datasets generation.
Right now it supports 5 datasets types:
- Text Classification Dataset
- Raw Text Generation Dataset
- Instruction Dataset (Ultrachat-like)
- Multiple Choice Question (MCQ) Dataset
- Preference Dataset
And more to come.
Currently supported LLM providers for generation are:
- OpenAI
- Anthropic
- Google Gemini
- Ollama (local LLM server)
There is more to come but I am not in a rush for features. I seek data quality, data diversity and reliability over quantity. I don't measure success by shipping more features: I succeed if it works when you try it out, and if you actually use it.
Hope you like that!