SparQ Attention: Bandwidth-Efficient LLM Inference

About

The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr• 2023

Related benchmarks

Task	Dataset	Result
Long-context Understanding	LongBench	Accuracy92.2	60
Long-context evaluation	RULER 16k	Total Score56.02	59
Long-context evaluation	RULER 32k	Overall Score36.74	49
Long-context evaluation	RULER 4k	Score87.93	35
Long-context evaluation	RULER 8k	Score68.97	35
Mathematical Reasoning	MATH 500	Flex Match84.8	27
Long-context Understanding	LongBench	MQA-E Score45.32	18
Long-context evaluation	LongBench v2	Overall Score26.2	9

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord