Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Data Engineering for Scaling Language Models to 128K Context

About

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng• 2024

Related benchmarks

TaskDatasetResultRank
Long-context Language UnderstandingRULER 32k context length
FWE34
39
Long-context Language UnderstandingRULER 16k context length
FWE Score47.5
21
Long-context Language UnderstandingRULER 4k context length
FWE Rate48
16
Long-context UnderstandingRULER 8k context
CWE39.25
13
Long-context Language UnderstandingLongBench-E 2024 (test)
Short Context QA Score6.89
12
Long-context Information ExtractionRULER 4K-32K Average
CWE Score32.94
6
Long-context Language UnderstandingLongBench (standard)
NQA2.07
6
Showing 7 of 7 rows

Other info

Follow for update