Elastic Attention Cores for Scalable Vision Transformers

About

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU55.54	3089
Image Classification	CIFAR-10	--	973
Image Classification	ImageNet V2	Top-1 Acc76.91	767
Semantic segmentation	ADE20K	mIoU50.69	699
Semantic segmentation	Cityscapes	mIoU69.54	526
Semantic segmentation	COCO Stuff	mIoU47.92	421
Image Classification	CIFAR-100	--	375
Fine-grained Image Classification	CUB-200 2011	Accuracy88.02	317
Semantic segmentation	Pascal VOC	mIoU0.8707	295
Image Classification	ImageNet-ReaL	Precision@189.71	287

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord