I built QGen, a tool that extracts structured Q&A datasets from documents using RAG (retrieval-augmented generation). I’d love feedback from anyone working with ML, document processing, or AI pipelines.
Problem
Turning PDFs, Word docs, and other unstructured files into high-quality Q&A pairs for model training is slow and error-prone. QGen automates this process, making it fast and scalable.
How It Works
- Document ingestion – PDF, Word, Excel, PPT, OCR - Embedding & retrieval – semantic search over chunks - Q&A generation – LLM generates and filters candidate pairs - Quality scoring – four-dimensional metrics for relevance, coverage, consistency - Export / API – JSON, CSV, SQL, XML; on-prem or cloud deployment
Who It’s For
- Startups prototyping AI - Data scientists training domain-specific models - Enterprises processing large document sets
Early Feedback & Limitations
- Sometimes questions are too shallow - Domain adaptation (legal, medical, research) needs tuning - Runtime can be high for large batches
I’m especially curious about what features you’d want, what trade-offs matter most, and how you’d integrate this into your workflow.
Try It / Feedback
Comment, or email contact@qelab.org to try QGen or share thoughts
Disclaimer: Parts of this post were drafted and formatted with AI assistance.