Synthesizer: Rethinking Self-Attention in Transformer Models

About

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/SuperGLUE benchmarks. When composed with dot product attention, we find that Synthesizers consistently outperform Transformers. Moreover, we conduct additional comparisons of Synthesizers against Dynamic Convolutions, showing that simple Random Synthesizer is not only $60\%$ faster but also improves perplexity by a relative $3.5\%$. Finally, we show that simple factorized Synthesizers can outperform Linformers on encoding only tasks.

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng• 2020

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-103 (test)	Perplexity32.43	703
Machine Translation	WMT En-De 2014 (test)	BLEU28.47	379
Language Modeling	WikiText-103 (val)	PPL31.31	261
Character-level Language Modeling	enwik8 (test)	BPC1.298	195
Long-range sequence modeling	Long Range Arena (LRA)	Text Accuracy61.68	177
Long-range sequence modeling	Long Range Arena (LRA) (test)	Accuracy (Avg)51.1	163
Text Classification	AGNews	Accuracy89.1	119
Text Classification	IMDB	Accuracy84.6	119
Long sequence classification	LRA (Long Range Arena) (test)	Average Accuracy52.88	92
Efficiency Analysis	Long Range Arena (LRA)	Steps per second65.44	84

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord