Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals
About
Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-10 (test) | Accuracy76.75 | 3381 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy80.68 | 1453 | |
| Image Classification | ImageNet A | Top-1 Acc8.2 | 553 | |
| Language Modeling | WikiText-103 (test) | Perplexity33.7 | 524 | |
| Image Classification | ImageNet-R | Top-1 Acc33.82 | 474 | |
| Language Modeling | WikiText-103 (val) | PPL32.6 | 180 | |
| Image Classification | ImageNet-C | mCE70.1 | 103 | |
| Segmentation | ADE20K | -- | 52 | |
| Object Classification | ImageNet (val) | Top-1 Accuracy73.23 | 5 | |
| Object Classification | CIFAR-10 | Accuracy76.75 | 2 |