DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

About

Variants dropout methods have been designed for the fully-connected layer, convolutional layer and recurrent layer in neural networks, and shown to be effective to avoid overfitting. As an appealing alternative to recurrent and convolutional layers, the fully-connected self-attention layer surprisingly lacks a specific dropout method. This paper explores the possibility of regularizing the attention weights in Transformers to prevent different contextualized feature vectors from co-adaption. Experiments on a wide range of tasks show that DropAttention can improve performance and reduce overfitting.

Lin Zehui, Pengfei Liu, Luyao Huang, Junkun Chen, Xipeng Qiu, Xuanjing Huang• 2019

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10 (test)	Accuracy85.24	882
Image Classification	CIFAR-100 (test)	Accuracy58.01	295
Natural Language Understanding	GLUE (test dev)	MRPC Accuracy90.2	90
Temporal Action Detection	THUMOS14 (test)	mAP53.89	37
Music Genre Classification	GTZAN (test)	Accuracy84.84	27
Temporal Action Detection	THUMOS14 Kinetics-400 features (test)	mAP62.03	12

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord