Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

About

Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.

Tam Nguyen, Tan M. Nguyen, Richard G. Baraniuk• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10 (test)	Accuracy76.75	3381
Image Classification	ImageNet-1k (val)	Top-1 Accuracy80.68	1498
Language Modeling	WikiText-103 (test)	Perplexity33.7	703
Image Classification	ImageNet A	Top-1 Acc8.2	698
Image Classification	ImageNet-R	Top-1 Acc33.82	581
Language Modeling	WikiText-103 (val)	PPL32.6	261
Image Classification	ImageNet-C	mCE70.1	134
Segmentation	ADE20K	--	59
Object Classification	ImageNet (val)	Top-1 Accuracy73.23	5
Object Classification	CIFAR-10	Accuracy76.75	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord