Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

About

Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.

Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo• 2025

Related benchmarks

TaskDatasetResultRank
Long-context Question Answering2WikiMQA
SubEM78
36
Long-context Question AnsweringNarrativeQA
SubEM20
36
Long-context Question AnsweringEn.QA
SubEM34.76
36
Long-context Question AnsweringMFQA en
SubEM24.67
36
Long-context UnderstandingMuSiQue
SubEM49.33
27
Long-context Question AnsweringMuSiQue
F1 Score46.35
19
Long-context UnderstandingAverage Overall
SubEM34.85
18
Long-context UnderstandingLV-Eval 16k
SubEM38.44
9
Long-context UnderstandingAverage MuSiQue, 2WikiMQA, MFQA-En, NarrativeQA, En.QA
SubEM Score40.89
9
Long-context UnderstandingLV-Eval 32k
SubEM35.83
9
Showing 10 of 12 rows

Other info

Follow for update