Data Engineering for Scaling Language Models to 128K Context
About
We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context Language Understanding | RULER 32k context length | FWE34 | 39 | |
| Long-context Language Understanding | RULER 16k context length | FWE Score47.5 | 21 | |
| Long-context Language Understanding | RULER 4k context length | FWE Rate48 | 16 | |
| Long-context Understanding | RULER 8k context | CWE39.25 | 13 | |
| Long-context Language Understanding | LongBench-E 2024 (test) | Short Context QA Score6.89 | 12 | |
| Long-context Information Extraction | RULER 4K-32K Average | CWE Score32.94 | 6 | |
| Long-context Language Understanding | LongBench (standard) | NQA2.07 | 6 |