Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
About
We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code: https://github.com/zaydzuhri/softpick-attention
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | LAMBADA | Accuracy43.61 | 412 | |
| Language Modeling | WikiText | Word Perplexity17.94 | 234 | |
| Word Prediction | LAMBADA | Accuracy43.41 | 192 | |
| Language Modeling | WikiText | Wikitext PPL17.86 | 87 | |
| Question Answering | ARC Easy | Normalized Accuracy62.04 | 55 | |
| Commonsense Reasoning | PIQA | Normalized Accuracy70.67 | 41 | |
| Question Answering | SciQ | Acc Norm80.4 | 32 | |
| Question Answering | ARC Easy | Normalized Accuracy62.21 | 20 | |
| Science Question Answering | SciQ | Accuracy86.9 | 16 | |
| Question Answering | PIQA | Normalized Accuracy70.89 | 8 |