Lessons on Parameter Sharing across Layers in Transformers

About

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.

Sho Takase, Shun Kiyono• 2021

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy38.7	1896
Automatic Speech Recognition	LibriSpeech clean (test)	WER3.32	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER7.71	1206
Question Answering	ARC Challenge	--	906
Commonsense Reasoning	PIQA	Accuracy67.4	757
Language Modeling	WikiText-103 (test)	Perplexity18.55	703
Question Answering	ARC Easy	Accuracy40.6	597
Automatic Speech Recognition	LibriSpeech (dev-other)	WER7.84	486
Language Modeling	LAMBADA	Accuracy36.3	412
Question Answering	SciQ	--	283

Showing 10 of 38 rows

Other info

Follow for update

@wizwand_team Discord