Krause Synchronization Transformers

About

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Empirically, we validate Krause Attention across diverse settings, including vision (ViT on CIFAR/ImageNet), autoregressive image generation (MNIST/CIFAR-10), large language models (Llama/Qwen), and language models trained from scratch at multiple scales (100M/200M). Across these domains, Krause Attention achieves consistent performance gains while improving computational efficiency, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.

Jingkun Liu, Yisong Yue, Max Welling, Yue Song• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	--	836
Commonsense Reasoning	PIQA	Accuracy77.77	757
Image Classification	Fashion MNIST	Accuracy96.1	317
Question Answering	BoolQ	Accuracy84.78	317
Image Classification	CIFAR-10	Accuracy95.35	246
Image Classification	ImageNet-1K	Accuracy75.69	199
Reasoning	PIQA	Accuracy73.7	164
Language Understanding	MMLU-Pro	Accuracy41.67	116
Natural Language Inference	MNLI	Accuracy83.83	22
Image Generation	MNIST (test)	--	17

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord