Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

About

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas• 2021

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy33.1
1896
Commonsense ReasoningWinoGrande
Accuracy49.5
1442
Commonsense ReasoningPIQA
Accuracy65.1
757
Language ModelingLAMBADA
Accuracy30.9
412
Language ModelingPubmed
Perplexity18.05
59
Language ModelingArxiv Proof-pile
Perplexity17.1
40
CopyCopy OOD lengths: 2x, 4x, 8x, 16x, 32x, 64x
Exact Match Accuracy100
30
MQMTARMQMTAR OOD lengths 2x 4x 16x 64x 256x 1024x
Exact Match Accuracy100
30
ReverseReverse OOD lengths: 1.5x, 2x, 4x, 8x
Exact Match Accuracy36
20
SortSort OOD lengths: 2x, 4x, 8x
Exact Match Accuracy0.00e+0
15
Showing 10 of 20 rows

Other info

Follow for update