Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Improving Transformer Models by Reordering their Sublayers

About

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those successful variants tend to have more self-attention at the bottom and more feedforward sublayers at the top. We propose a new transformer pattern that adheres to this property, the sandwich transformer, and show that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. However, the sandwich reordering pattern does not guarantee performance gains across every task, as we demonstrate on machine translation models. Instead, we suggest that further exploration of task-specific sublayer reorderings is needed in order to unlock additional gains.

Ofir Press, Noah A. Smith, Omer Levy• 2019

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag--
1891
Question AnsweringARC Easy
Accuracy63.43
597
Language ModelingWikiText-103 (test)
Perplexity17.96
579
Character-level Language Modelingenwik8 (test)
BPC0.968
195
Language ModelingWikiText-103
PPL18.2
189
Character-level Language Modelingtext8 (test)
BPC1.076
128
Question AnsweringARC Challenge
Normalized Accuracy30.8
86
Word-level Language ModelingWikiText-103 word-level (test)
Perplexity17.96
65
Language ModelEnwiki8
BPC1.1
23
Character-level Language Modelingtext8
BPC1.18
16
Showing 10 of 15 rows

Other info

Code

Follow for update