Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics
About
Retrieval Augmented Generation (RAG)'s success depends on the utility the LLM derives from the content used for grounding. Quantifying content utility does not have a definitive specification and existing metrics ignore model-specific capabilities and/or rely on costly annotations. In this paper, we propose Grounding Generation Utility (GroGU), a model-specific and reference-free metric that defines utility as a function of the downstream LLM's generation confidence based on entropy. Despite having no annotation requirements, GroGU is largely faithful in distinguishing ground-truth documents while capturing nuances ignored by LLM-agnostic metrics. We apply GroGU to train a query-rewriter for RAG by identifying high-utility preference data for Direct Preference Optimization. Experiments show improvements by up to 18.2 points in Mean Reciprocal Rank and up to 9.4 points in answer accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Conversational Query Retrieval | TopiOCQA | MRR38.1 | 20 | |
| Conversational Query Retrieval | QReCC | MRR45.9 | 20 | |
| Conversational Information Retrieval | TopiOCQA (test) | R@1061.7 | 13 | |
| Conversational Information Retrieval | QReCC (test) | R@1067.2 | 13 | |
| Conversational Question Answering | QReCC (test) | EM (%)120 | 12 | |
| Conversational Question Answering | TopiOCQA (test) | EM20.8 | 12 |