GLU Variants Improve Transformer

About

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Noam Shazeer• 2020

Related benchmarks

Task	Dataset	Result
Continual Supervised Learning	CIFAR Random Label	Total Average Online Task Accuracy83.06	49
Continual Supervised Learning	CIFAR 5+1	Total Average Online Task Accuracy9.57	49
Continual Supervised Learning	Continual ImageNet	Total Average Online Task Accuracy63.57	49
Continual Learning	Permuted MNIST	--	32
Continual Learning	MNIST Shuffled Labels	Accuracy (ACC)31.2	22
Plasticity Measurement	Locomotion Tasks Aggregate (Ant, HalfCheetah, Humanoid) (train)	Plasticity Score (IQM)7.81	17
ICU Prediction	SIICU	AUPRC (Sample)37.2	7
Language Modeling	LLM (val)	Loss1.3374	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord