Most existing benchmarks focus on synthetic or short-form QA data. That didn’t reflect what we were seeing in production, so we built our own to test our hallucination detectors, and decided to open source it.
The dataset includes 6,500 examples across QA, summarization, and NLI tasks. We added distractor documents, shuffled the context, and removed assumptions about format (like requiring a question) to better reflect real-world conditions.
We ran 7 detection systems on it, both open-source models and commercial APIs. While some performed well on shorter examples, even the best struggled with long-form content and multi-document grounding -- precisely where hallucinations tend to be most harmful.
Would love feedback, especially from anyone working on evals, hallucination detection, or RAG.
Links: – HF Dataset: https://huggingface.co/datasets/quotient-ai/hallumix - HF Blog: https://huggingface.co/blog/quotientai/hallumix – Internal Blog: https://blog.quotientai.co/introducing-hallumix-a-task-agnos... – Paper: https://arxiv.org/abs/2505.00506