Large Language Models Can Self-Improve in Long-context Reasoning

About

Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of $4.2$ points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

Siheng Li, Cheng Yang, Zesen Cheng, Lemao Liu, Mo Yu, Yujiu Yang, Wai Lam• 2024

Related benchmarks

Task	Dataset	Result
Multi-modal Question Answering	MedXpertQA-MM	Accuracy6.51	38
Long-context Question Answering	En.QA	SubEM36.47	36
Long-context Question Answering	MFQA en	SubEM27.33	36
Long-context Question Answering	2WikiMQA	SubEM77	36
Long-context Question Answering	NarrativeQA	SubEM19	36
General Knowledge	MMLU	pass@169.36	31
Expert knowledge QA	GPQA	Pass@118.69	29
Long-context Understanding	MuSiQue	SubEM46.5	27
Long-context Question Answering	MuSiQue	F1 Score44.11	19
Long-context Understanding	Average Overall	SubEM36.71	18

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord