Spectral Conditioning of Attention Improves Transformer Performance
About
We present a theoretical analysis of the Jacobian of an attention block within a transformer, showing that it is governed by the query, key, and value projections that define the attention mechanism. Leveraging this insight, we introduce a method that systematically alters the spectral properties of each attention layer to reduce the Jacobian's condition number, thereby improving the overall conditioning of the attention layers within a transformer network. We empirically show that this improved Jacobian conditioning translates to enhanced performance in practice. Our approach is simple, broadly applicable, and can be easily integrated as a drop-in replacement for a wide range of existing attention mechanisms. We validate its effectiveness across diverse transformer architectures and tasks, demonstrating consistent improvements in performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2643 | |
| Instance Segmentation | COCO 2017 (val) | APm0.405 | 1201 | |
| Natural Language Understanding | GLUE | SST-292.7 | 531 | |
| Object Detection | COCO | AP50 (Box)68.1 | 237 | |
| Long-range sequence modeling | Long Range Arena (LRA) | Text Accuracy64.8 | 177 |