LUCID: Attention with Preconditioned Representations

About

Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.

Sai Surya Duvvuri, Nirmal Patel, Nilesh Gupta, Inderjit S. Dhillon• 2026

Related benchmarks

Task	Dataset	Result
Single-Doc Question Answering	LongBench	MultifieldQA Score0.149	75
Long-context Question Answering	LongBench (test)	HotpotQA8.6	69
Summarization	SCROLLS	ROUGE-114.83	8

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord