Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models
About
Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by $l_{1}$-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial for foundation models to develop genuine physical world models. Our code is available at https://github.com/iminfine/freeattn.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | ARC Easy | Normalized Acc40.57 | 391 | |
| Sentence Completion | HellaSwag | -- | 364 | |
| Question Answering | ARC Challenge | Normalized Accuracy22.61 | 105 | |
| Question Answering | MMLU | Accuracy22.97 | 74 | |
| Language Modeling | FineWeb 10B 2K sequence length | Validation Loss3.193 | 16 | |
| Language Modeling | FineWeb 10B 4K sequence length | Validation Loss3.2291 | 16 | |
| Language Modeling | FineWeb 10B 8K sequence length | Validation Loss3.3171 | 16 | |
| Language Modeling | FineWeb 10B 1K sequence length | Validation Loss3.1782 | 16 | |
| Question Answering | PIQA | Normalized Accuracy65.34 | 10 | |
| Summarization | SummScreen | ROUGE-16.309 | 2 |