Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

About

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention design we call Cross-Layer Attention (CLA). With CLA, we find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA. In experiments training 1B- and 3B-parameter models from scratch, we demonstrate that CLA provides a Pareto improvement over the memory/accuracy tradeoffs which are possible with traditional MQA, enabling inference with longer sequence lengths and larger batch sizes than would otherwise be possible

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy57.46	1442
Commonsense Reasoning	PIQA	Accuracy74.32	757
Physical Commonsense Reasoning	PIQA	Accuracy74.54	696
Question Answering	ARC-E	Accuracy65.53	523
Question Answering	OpenBookQA	Accuracy25.2	465
Language Modeling	LAMBADA	Accuracy49.2	412
Sentence Completion	HellaSwag	Accuracy47.12	364
Boolean Question Answering	BoolQ	Accuracy53.21	350
Question Answering	OBQA	Accuracy25.2	347
Question Answering	SciQ	Accuracy89.2	283

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord