RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
About
Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | ARC Challenge | Accuracy70.6 | 749 | |
| Multi-hop Question Answering | 2WikiMultihopQA | EM38.2 | 278 | |
| Question Answering | OBQA | Accuracy87.5 | 276 | |
| Multi-hop Question Answering | HotpotQA | F1 Score63.6 | 221 | |
| Question Answering | PopQA | Accuracy66.1 | 186 | |
| Question Answering | PubMedQA | Accuracy79.8 | 145 | |
| Question Answering | TriviaQA | Accuracy92.3 | 85 | |
| Question Answering | 2Wiki | F160 | 75 | |
| Question Answering | ARC-C | Accuracy0.696 | 68 | |
| Fact Verification | FEVER | Accuracy0.938 | 67 |