LongAlign: A Recipe for Long Context Alignment of Large Language Models
About
Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context Understanding | LongBench (test) | Avg Score27.5 | 136 | |
| Long-context Understanding | LongBench | Overall Average Score56.6 | 115 | |
| Long-context Question Answering | MFQA en | SubEM27.33 | 36 | |
| Long-context Question Answering | En.QA | SubEM32.19 | 36 | |
| Long-context Question Answering | NarrativeQA | SubEM18.5 | 36 | |
| Long-context Question Answering | 2WikiMQA | SubEM69 | 36 | |
| Long-context Understanding | MuSiQue | SubEM37.5 | 27 | |
| Long-context Question Answering | MuSiQue | F1 Score28.76 | 19 | |
| Long-context Understanding | Average Overall | SubEM33.99 | 18 | |
| Multi-Task | Longbench-Chat | Point-wise Rate69.8 | 10 |