Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

About

Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by $l_{1}$-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial for foundation models to develop genuine physical world models. Our code is available at https://github.com/iminfine/freeattn.

Bo Gao, Michael W. Spratling, Letizia Gionfrida• 2025

Related benchmarks

Task	Dataset	Result
Sentence Completion	HellaSwag	--	440
Question Answering	ARC Easy	Normalized Acc40.57	420
Question Answering	ARC Challenge	Normalized Accuracy22.61	105
Question Answering	MMLU	Accuracy22.97	74
Language Modeling	FineWeb 10B 2K sequence length	Validation Loss3.193	16
Language Modeling	FineWeb 10B 4K sequence length	Validation Loss3.2291	16
Language Modeling	FineWeb 10B 8K sequence length	Validation Loss3.3171	16
Language Modeling	FineWeb 10B 1K sequence length	Validation Loss3.1782	16
Question Answering	PIQA	Normalized Accuracy65.34	13
Summarization	SummScreen	ROUGE-16.309	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord