RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

About

Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, Bryan Catanzaro• 2024

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	Accuracy70.6	906
Multi-hop Question Answering	2WikiMultihopQA	EM38.2	559
Question Answering	OBQA	Accuracy87.5	347
Multi-hop Question Answering	HotpotQA	F1 Score63.6	294
Question Answering	2Wiki	--	241
Multi-hop Question Answering	2Wiki	--	215
Question Answering	PopQA	Accuracy66.1	186
Question Answering	PubMedQA (test)	Accuracy65	170
Question Answering	PubMedQA	Accuracy79.8	145
Question Answering	HotpotQA	F155.4	132

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord