Learning Sparse Visual Representations via Spatial-Semantic Factorization

About

Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at https://aka.ms/stellar.

Theodore Zhengde Zhao, Sid Kiblawi, Jianwei Yang, Naoto Usuyama, Reuben Tan, Noel C Codella, Tristan Naumann, Hoifung Poon, Mu Wei• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU41.98	3069
Semantic segmentation	ADE20K	mIoU36.66	1028
Semantic segmentation	Cityscapes	mIoU33.3	668
Image Classification	Food-101	Accuracy77.43	570
Semantic segmentation	Pascal VOC	mIoU0.859	280
Image Classification	Oxford-IIIT Pet	Accuracy92.53	219
Image Classification	ImageNet-1k (val)	Accuracy80.05	199
Text-to-Image Retrieval	MS-COCO	--	187
Image Reconstruction	ImageNet1K (val)	FID2.6	124
Image Classification	ImageNet-1K	Top-1 Acc51.53	75

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord