Graph Convolutions Enrich the Self-Attention in Transformers!

About

Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose a graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph-level tasks, speech recognition, and code classification.

Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park• 2023

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-2 (test)	PPL20.923	2333
Language Modeling	WikiText-103 (test)	Perplexity15.919	703
Image Classification	ImageNet-1K	Top-1 Acc83	600
Image Classification	ImageNet 1k (test)	Top-1 Accuracy83	456
Natural Language Understanding	GLUE (val)	SST-295.41	201
Graph Regression	ZINC	MAE0.069	144
Graph Regression	Peptides-struct	MAE0.2461	134
Language Modeling	Penn Treebank (PTB) (test)	Perplexity19.45	130
Graph Classification	CIFAR10	Accuracy72.44	118
Graph Classification	MNIST	Accuracy98.26	103

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord