Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

About

Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.

Tam Nguyen, Tan M. Nguyen, Richard G. Baraniuk• 2023

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-10 (test)
Accuracy76.75
3381
Image ClassificationImageNet-1k (val)
Top-1 Accuracy80.68
1453
Image ClassificationImageNet A
Top-1 Acc8.2
553
Language ModelingWikiText-103 (test)
Perplexity33.7
524
Image ClassificationImageNet-R
Top-1 Acc33.82
474
Language ModelingWikiText-103 (val)
PPL32.6
180
Image ClassificationImageNet-C
mCE70.1
103
SegmentationADE20K--
52
Object ClassificationImageNet (val)
Top-1 Accuracy73.23
5
Object ClassificationCIFAR-10
Accuracy76.75
2
Showing 10 of 10 rows

Other info

Follow for update