Linformer: Self-Attention with Linear Complexity

About

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma• 2020

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100 (test)	Accuracy70.87	3518
Image Classification	CIFAR-10 (test)	Accuracy92.45	3381
Image Classification	ImageNet-1K 1.0 (val)	Top-1 Accuracy78.7	2238
Language Modeling	PTB	Perplexity48.9	1234
Automatic Speech Recognition	LibriSpeech clean (test)	WER7.2	1207
Image Classification	ImageNet (val)	Top-1 Acc77.6	1206
Automatic Speech Recognition	LibriSpeech (test-other)	WER13.6	1206
Language Modeling	WikiText-103 (test)	Perplexity26.1	703
Image Classification	CIFAR-10	--	564
Semantic segmentation	ADE20K	mIoU18.82	559

Showing 10 of 77 rows

...

Other info

Follow for update

@wizwand_team Discord