Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

About

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code: https://github.com/zaydzuhri/softpick-attention

Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingLAMBADA
Accuracy43.61
412
Language ModelingWikiText
Word Perplexity17.94
234
Word PredictionLAMBADA
Accuracy43.41
192
Language ModelingWikiText
Wikitext PPL17.86
87
Question AnsweringARC Easy
Normalized Accuracy62.04
55
Commonsense ReasoningPIQA
Normalized Accuracy70.67
41
Question AnsweringSciQ
Acc Norm80.4
32
Question AnsweringARC Easy
Normalized Accuracy62.21
20
Science Question AnsweringSciQ
Accuracy86.9
16
Question AnsweringPIQA
Normalized Accuracy70.89
8
Showing 10 of 10 rows

Other info

Follow for update