Hi HN, we are the authors of MiRAGE.
We built this because standard RAG benchmarks (like Natural Questions) rely on text-only Wikipedia-like data, which doesn't reflect the reality of enterprise RAG. In the real world, "truth" is often locked in a chart, a complex table, or a diagram deep inside a PDF.
MiRAGE is an open-source framework that uses a swarm of specialized agents to reverse-engineer evaluation datasets from your own documents.
How it works:
1. Ingest: It uses vision models to describe charts/tables and "semantically chunk" the PDF.
2. Generate: An agent swarm (Generator, Retriever, Persona-Injector) creates multi-hop questions.
3. Verify: An adversarial "Verifier Agent" fact-checks the answers against the source to prevent hallucinated ground truth.
Key Finding: In our ablation studies, removing the adversarial verifier dropped the faithfulness of the generated dataset from 97% to 74%. Synthetic data needs self-verification.
Resources:
- Paper (arXiv): https://arxiv.org/abs/2601.15487 - Install: pip install mirage-benchmark - Demo: (See the terminal video in the repo)
We’d like your feedback, especially on the "Visual Grounding" challenge, it’s still the hardest part of multimodal RAG. Happy to answer any questions!