Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GLU Variants Improve Transformer

About

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Noam Shazeer• 2020

Related benchmarks

TaskDatasetResultRank
Continual Supervised LearningCIFAR Random Label
Total Average Online Task Accuracy83.06
49
Continual Supervised LearningCIFAR 5+1
Total Average Online Task Accuracy9.57
49
Continual Supervised LearningContinual ImageNet
Total Average Online Task Accuracy63.57
49
Continual LearningPermuted MNIST--
32
Continual LearningMNIST Shuffled Labels
Accuracy (ACC)31.2
22
Plasticity MeasurementLocomotion Tasks Aggregate (Ant, HalfCheetah, Humanoid) (train)
Plasticity Score (IQM)7.81
17
ICU PredictionSIICU
AUPRC (Sample)37.2
7
Language ModelingLLM (val)
Loss1.3374
4
Showing 8 of 8 rows

Other info

Follow for update