Memory-efficient Transformers via Top-$k$ Attention

About

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-$k$ approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.

Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, Jonathan Berant• 2021

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2	--	2320
Image Classification	CIFAR-10	--	564
MQMTAR	MQMTAR OOD lengths 2x 4x 16x 64x 256x 1024x	Exact Match Accuracy99.9	30
Copy	Copy OOD lengths: 2x, 4x, 8x, 16x, 32x, 64x	Exact Match Accuracy99.7	30
Reverse	Reverse OOD lengths: 1.5x, 2x, 4x, 8x	Exact Match Accuracy100	20
Sort	Sort OOD lengths: 2x, 4x, 8x	Exact Match Accuracy92.5	15
Image Classification	ImageNet-1K	Top-1 Accuracy73.4	9
Image Classification	Tiny-ImageNet	Top-1 Accuracy72.9	9
Image Classification	CIFAR-100	Top-1 Accuracy57.1	9
Copy	Copy ID, n=64	Exact Match Accuracy100	5

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord