Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Large Language Models Can Self-Improve in Long-context Reasoning

About

Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of $4.2$ points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

Siheng Li, Cheng Yang, Zesen Cheng, Lemao Liu, Mo Yu, Yujiu Yang, Wai Lam• 2024

Related benchmarks

TaskDatasetResultRank
Multi-modal Question AnsweringMedXpertQA-MM
Accuracy6.51
38
Long-context Question AnsweringEn.QA
SubEM36.47
36
Long-context Question AnsweringMFQA en
SubEM27.33
36
Long-context Question Answering2WikiMQA
SubEM77
36
Long-context Question AnsweringNarrativeQA
SubEM19
36
Long-context UnderstandingMuSiQue
SubEM46.5
27
General KnowledgeMMLU
pass@169.36
22
Long-context Question AnsweringMuSiQue
F1 Score44.11
19
Long-context UnderstandingAverage Overall
SubEM36.71
18
Expert knowledge QAGPQA
Pass@118.69
12
Showing 10 of 17 rows

Other info

Follow for update