Lessons on Parameter Sharing across Layers in Transformers
About
We propose a parameter sharing method for Transformers (Vaswani et al., 2017). The proposed approach relaxes a widely used technique, which shares parameters for one layer with all layers such as Universal Transformers (Dehghani et al., 2019), to increase the efficiency in the computational time. We propose three strategies: Sequence, Cycle, and Cycle (rev) to assign parameters to each layer. Experimental results show that the proposed strategies are efficient in the parameter size and computational time. Moreover, we indicate that the proposed strategies are also effective in the configuration where we use many training data such as the recent WMT competition.
Sho Takase, Shun Kiyono• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy38.7 | 1460 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER7.71 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER3.32 | 833 | |
| Question Answering | ARC Challenge | -- | 749 | |
| Commonsense Reasoning | PIQA | Accuracy67.4 | 647 | |
| Language Modeling | WikiText-103 (test) | Perplexity18.55 | 524 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER7.84 | 411 | |
| Question Answering | ARC Easy | Accuracy40.6 | 386 | |
| Question Answering | SciQ | -- | 226 | |
| Language Modeling | LAMBADA | Accuracy36.3 | 183 |
Showing 10 of 38 rows