The traditional RAG works fine until you ask questions like: - "Who was born before 1800?" - "How many are mathematicians?" - "List names and birthdays for mathematicians"
These result in an incomplete answer due to top-k, with no signs of incompleteness.
For an initial corpus, it is possible to improve this problem by extracting metadata for a predetermined set of fields. This approach has two problems:
- One has to predict all the questions that can be asked against the corpus upfront. - Constantly revising that prediction as the documents change, e.g. adding Nobel prizes later, or extending the document set to contain artists.
DuoRAG aims to solve both problems by:
- An initial metadata (schema) discovery before the first ingestion - Self-update schema with candidate fields when it fails to answer a question