RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

About

Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.

Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang• 2024

Related benchmarks

Task	Dataset	Result
Question Answering	ARC	Accuracy85.1	230
Question Answering	HotpotQA	F162.9	132
Question Answering	TQA	Accuracy83.8	80
Question Answering	PopQA	Accuracy (Acc)70.4	26
Question Answering	Pub	Accuracy82.8	22

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord