Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention

About

Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks the correlation between a query's norm and the spikiness (entropy) of the attention distribution as in softmax attention. (2) Standard techniques for enforcing non-negativity cause destructive information loss by nullifying valid inner-product interactions. To address these challenges, we introduce NaLaFormer, a novel linear attention mechanism built upon a norm$\times$direction (ND) decomposition of the query and key vectors. We leverage each component to solve a distinct problem: The query norm is injected into our kernel to create a query-norm-aware map that restores the attention distribution's spikiness. The direction vectors are processed by a geometric, cosine-based similarity metric that guarantees non-negativity while preserving the rich, fine-grained information of the inner product. We validate NaLaFormer through a comprehensive multi-modal evaluation, where it sets new state-of-the-art benchmarks for linear attention. Our model achieves up to a 7.5% accuracy gain on ImageNet-1K and a 4.7% mIoU improvement on ADE20K over comparable baselines. It demonstrates profound efficiency, reducing peak memory by a transformative 92.3% in token-intensive super-resolution tasks (70K+ tokens). NaLaFormer's versatility is further confirmed as it surpasses strong baselines like Mamba on common-sense reasoning and sets a new state-of-the-art on the Long Range Arena (LRA) benchmark. Code is available at https://github.com/ZacharyMeng/NaLaFormer .

Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Xin Li, Zheng Zhang• 2025

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	--	2843
Image Classification	ImageNet-1K 1.0 (val)	Top-1 Accuracy85.7	2238
Semantic segmentation	ADE20K	mIoU48.5	1028
Image Super-resolution	Set5	PSNR34.81	774
Semantic segmentation	Cityscapes	mIoU83.5	494
Object Detection	COCO	AP50 (Box)71.2	237
Long-range sequence modeling	Long Range Arena (LRA)	--	177
Image Generation	ImageNet-1k (val)	FID53.08	106
Semantic segmentation	ADE20K	mIoU48.5	71
Image Super-resolution	Set14	PSNR30.71	50

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord