Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

About

Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.

Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang• 2025

Related benchmarks

TaskDatasetResultRank
Grid Navigation1D Grid Navigation Sequence length 128, grid width 64 (IID)
Accuracy93
12
Grid Navigation1D Grid Navigation 64/32/0.2 (OOD-dense D)
Accuracy96
12
Grid Navigation1D Grid Navigation OOD-sparse S 256/128/0.8
Accuracy67
12
Grid Navigation2D Grid Navigation Sequence length 128, grid width 64 (IID)
Accuracy65
12
Grid Navigation2D Grid Navigation S 256/128/0.8 (OOD-sparse)
Accuracy53
12
Grid Navigation2D Grid Navigation D 64/32/0.2 (OOD-dense)
Accuracy68
12
Selective-Copy TaskSelective-Copy IID 128/128 blank/non-blank
Accuracy100
8
Selective-Copy TaskSelective-Copy OOD Dense split: 64/128 blank/non-blank
Accuracy88
8
Selective-Copy TaskSelective-Copy OOD Sparse 256 128 blank non-blank
Accuracy10
8
Showing 9 of 9 rows

Other info

Follow for update