Inference Scaling for Long-Context Retrieval Augmented Generation

About

The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring the combination of multiple strategies beyond simply increasing the quantity of knowledge, including in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs' ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance benefit from the scaling of inference computation when optimally configured? (2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters? Our observations reveal that increasing inference computation leads to nearly linear gains in RAG performance when optimally allocated, a relationship we describe as the inference scaling laws for RAG. Building on this, we further develop the computation allocation model to estimate RAG performance across different inference configurations. The model predicts optimal inference parameters under various computation constraints, which align closely with the experimental results. By applying these optimal configurations, we demonstrate that scaling inference compute on long-context LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky• 2024

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Challenge	Accuracy71.3	906
Question Answering	OBQA	Accuracy86.6	347
Multi-hop Question Answering	HotpotQA (test)	F134.83	311
Multi-hop Question Answering	HotpotQA	F1 Score64.3	294
Question Answering	2Wiki	--	241
Multi-hop Question Answering	2Wiki	Exact Match44.3	215
Question Answering	PopQA	Accuracy66.5	186
Question Answering	ARC-C	Accuracy0.704	116
Multi-hop Question Answering	Bamboogle (test)	EM23.44	98
Question Answering	TQA	Accuracy71.9	80

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord