Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
About
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Grid Navigation | 1D Grid Navigation Sequence length 128, grid width 64 (IID) | Accuracy93 | 12 | |
| Grid Navigation | 1D Grid Navigation 64/32/0.2 (OOD-dense D) | Accuracy96 | 12 | |
| Grid Navigation | 1D Grid Navigation OOD-sparse S 256/128/0.8 | Accuracy67 | 12 | |
| Grid Navigation | 2D Grid Navigation Sequence length 128, grid width 64 (IID) | Accuracy65 | 12 | |
| Grid Navigation | 2D Grid Navigation S 256/128/0.8 (OOD-sparse) | Accuracy53 | 12 | |
| Grid Navigation | 2D Grid Navigation D 64/32/0.2 (OOD-dense) | Accuracy68 | 12 | |
| Selective-Copy Task | Selective-Copy IID 128/128 blank/non-blank | Accuracy100 | 8 | |
| Selective-Copy Task | Selective-Copy OOD Dense split: 64/128 blank/non-blank | Accuracy88 | 8 | |
| Selective-Copy Task | Selective-Copy OOD Sparse 256 128 blank non-blank | Accuracy10 | 8 |