Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

About

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code: https://github.com/zaydzuhri/softpick-attention

Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	LAMBADA	Accuracy43.61	412
Language Modeling	WikiText	Word Perplexity17.94	331
Word Prediction	LAMBADA	Accuracy43.41	222
Language Modeling	WikiText	Wikitext PPL17.86	151
Question Answering	ARC Easy	Normalized Accuracy62.04	55
Commonsense Reasoning	PIQA	Normalized Accuracy70.67	41
Question Answering	SciQ	Acc Norm80.4	32
Question Answering	ARC Easy	Normalized Accuracy62.21	20
Science Question Answering	SciQ	Accuracy86.9	16
Question Answering	PIQA	Normalized Accuracy70.89	8

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord