Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

About

Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinsk\"u• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingFineWeb (val)--
156
Language UnderstandingMMLU (test)--
136
Commonsense ReasoningARC-E
Accuracy57.7
62
Needle-in-a-HaystackNeedle-in-a-Haystack
Accuracy100
44
Needle-in-a-HaystackRuler NIAH (Single 2)
Accuracy1
25
Language ModelingWikiText (held-out)
Perplexity (PPL)18.6897
25
Long-context Language UnderstandingLongBench v2
Overall Accuracy28.6
20
Needle-in-a-HaystackRuler NIAH Single 3
Accuracy84
13
Long-context retrieval (Single 1)RULER
Retrieval Accuracy @ 1024 Context100
8
Commonsense ReasoningARC Challenge (test)
Accuracy41.32
2
Showing 10 of 14 rows

Other info

Follow for update