This post argues that production RAG should be evaluated as set consumption, not as a user scrolling a ranked list. Classic IR metrics (nDCG / MAP / MRR) assume a human eyeball stepping through positions with monotone position discount, which doesn’t match how an LLM ingests a fixed top-K context.
I propose a small family of set-based metrics:
• RA-nWG@K – “How good is the actual top-K set we fed the LLM vs the global oracle on the labeled corpus?”
• PROC@K – pool-restricted oracle ceiling: “How good could we have done with this retrieval pool if selection were perfect?”
• %PROC@K – reranker/selection efficiency: “Given that ceiling, how much did our actual top-K realize?”
The goal is to cleanly separate retrieval quality from reranking headroom instead of squinting at one nDCG number.
I’m actively refining this; if you see flaws, better decompositions, or edge cases where this breaks, I’d really like to hear them.
etoud•1h ago
I propose a small family of set-based metrics:
• RA-nWG@K – “How good is the actual top-K set we fed the LLM vs the global oracle on the labeled corpus?”
• PROC@K – pool-restricted oracle ceiling: “How good could we have done with this retrieval pool if selection were perfect?”
• %PROC@K – reranker/selection efficiency: “Given that ceiling, how much did our actual top-K realize?”
The goal is to cleanly separate retrieval quality from reranking headroom instead of squinting at one nDCG number.
I’m actively refining this; if you see flaws, better decompositions, or edge cases where this breaks, I’d really like to hear them.