LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data
About
Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context Question Answering | 2WikiMQA | SubEM78 | 36 | |
| Long-context Question Answering | NarrativeQA | SubEM20 | 36 | |
| Long-context Question Answering | En.QA | SubEM34.76 | 36 | |
| Long-context Question Answering | MFQA en | SubEM24.67 | 36 | |
| Long-context Understanding | MuSiQue | SubEM49.33 | 27 | |
| Long-context Question Answering | MuSiQue | F1 Score46.35 | 19 | |
| Long-context Understanding | Average Overall | SubEM34.85 | 18 | |
| Long-context Understanding | LV-Eval 16k | SubEM38.44 | 9 | |
| Long-context Understanding | Average MuSiQue, 2WikiMQA, MFQA-En, NarrativeQA, En.QA | SubEM Score40.89 | 9 | |
| Long-context Understanding | LV-Eval 32k | SubEM35.83 | 9 |