LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data

About

Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.

Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo• 2025

Related benchmarks

Task	Dataset	Result
Long-context Question Answering	2WikiMQA	SubEM78	36
Long-context Question Answering	NarrativeQA	SubEM20	36
Long-context Question Answering	En.QA	SubEM34.76	36
Long-context Question Answering	MFQA en	SubEM24.67	36
Long-context Understanding	MuSiQue	SubEM49.33	27
Long-context Question Answering	MuSiQue	F1 Score46.35	19
Long-context Understanding	Average Overall	SubEM34.85	18
Long-context Understanding	LV-Eval 16k	SubEM38.44	9
Long-context Understanding	Average MuSiQue, 2WikiMQA, MFQA-En, NarrativeQA, En.QA	SubEM Score40.89	9
Long-context Understanding	LV-Eval 32k	SubEM35.83	9

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord