GLU Variants Improve Transformer
About
Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.
Noam Shazeer• 2020
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continual Supervised Learning | CIFAR Random Label | Total Average Online Task Accuracy83.06 | 49 | |
| Continual Supervised Learning | CIFAR 5+1 | Total Average Online Task Accuracy9.57 | 49 | |
| Continual Supervised Learning | Continual ImageNet | Total Average Online Task Accuracy63.57 | 49 | |
| Continual Learning | Permuted MNIST | -- | 32 | |
| Continual Learning | MNIST Shuffled Labels | Accuracy (ACC)31.2 | 22 | |
| Plasticity Measurement | Locomotion Tasks Aggregate (Ant, HalfCheetah, Humanoid) (train) | Plasticity Score (IQM)7.81 | 17 | |
| ICU Prediction | SIICU | AUPRC (Sample)37.2 | 7 | |
| Language Modeling | LLM (val) | Loss1.3374 | 4 |
Showing 8 of 8 rows