Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

About

Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.

Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang• 2025

Related benchmarks

Task	Dataset	Result
Grid Navigation	1D Grid Navigation Sequence length 128, grid width 64 (IID)	Accuracy93	12
Grid Navigation	1D Grid Navigation 64/32/0.2 (OOD-dense D)	Accuracy96	12
Grid Navigation	1D Grid Navigation OOD-sparse S 256/128/0.8	Accuracy67	12
Grid Navigation	2D Grid Navigation Sequence length 128, grid width 64 (IID)	Accuracy65	12
Grid Navigation	2D Grid Navigation S 256/128/0.8 (OOD-sparse)	Accuracy53	12
Grid Navigation	2D Grid Navigation D 64/32/0.2 (OOD-dense)	Accuracy68	12
Selective-Copy Task	Selective-Copy IID 128/128 blank/non-blank	Accuracy100	8
Selective-Copy Task	Selective-Copy OOD Dense split: 64/128 blank/non-blank	Accuracy88	8
Selective-Copy Task	Selective-Copy OOD Sparse 256 128 blank non-blank	Accuracy10	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord