Data Engineering for Scaling Language Models to 128K Context

About

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng• 2024

Related benchmarks

Task	Dataset	Result
Long-context Language Understanding	RULER 32k context length	FWE34	39
Long-context Language Understanding	RULER 16k context length	FWE Score47.5	21
Long-context Language Understanding	RULER 4k context length	FWE Rate48	16
Long-context Understanding	RULER 8k context	CWE39.25	13
Long-context Language Understanding	LongBench-E 2024 (test)	Short Context QA Score6.89	12
Long-context Information Extraction	RULER 4K-32K Average	CWE Score32.94	6
Long-context Language Understanding	LongBench (standard)	NQA2.07	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord