Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Lessons on Parameter Sharing across Layers in Transformers

About

We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.

Sho Takase, Shun Kiyono• 2021

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy38.7
1891
Automatic Speech RecognitionLibriSpeech clean (test)
WER3.32
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER7.71
1151
Question AnsweringARC Challenge--
906
Commonsense ReasoningPIQA
Accuracy67.4
751
Question AnsweringARC Easy
Accuracy40.6
597
Language ModelingWikiText-103 (test)
Perplexity18.55
579
Automatic Speech RecognitionLibriSpeech (dev-other)
WER7.84
462
Question AnsweringSciQ--
283
Language ModelingLAMBADA
Accuracy36.3
268
Showing 10 of 38 rows

Other info

Follow for update