Lost in the Middle: How Language Models Use Long Contexts

About

While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang• 2023

Related benchmarks

Task	Dataset	Result
Question Answering	SQuAD	Accuracy75.21	32
Open-ended Information Seeking	Infinity-Chat (test)	DSem0.119	20
Question Answering	AstroQA (Hard)	Accuracy62.6	10
Question Answering	AstroQA Easy	Accuracy62.06	10
Question Answering	AstroQA Medium	Accuracy58.87	10
Anemia prediction	EHRSHOT (test)	Accuracy48.42	6
30-day Readmission	EHRSHOT 30-day Readmission	Accuracy52.67	6
Acute Myocardial Infarction prediction	EHRSHOT (test)	Accuracy89.21	6
Long Length of Stay	EHRSHOT Long Length of Stay	Accuracy65.1	6
Question Answering	AstroQA 1.0 (Total)	Accuracy61.3	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord