Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

About

Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, Bryan Catanzaro• 2024

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC Challenge
Accuracy70.6
906
Multi-hop Question Answering2WikiMultihopQA
EM38.2
387
Question AnsweringOBQA
Accuracy87.5
300
Multi-hop Question AnsweringHotpotQA
F1 Score63.6
294
Question AnsweringPopQA
Accuracy66.1
186
Question Answering2Wiki
F160
152
Multi-hop Question Answering2Wiki--
152
Question AnsweringPubMedQA
Accuracy79.8
145
Question AnsweringHotpotQA
F155.4
128
Question AnsweringPubMedQA (test)
Accuracy65
128
Showing 10 of 24 rows

Other info

Follow for update