SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

About

Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

Naomi Kombol, Ivan Martinovi\'c, Sini\v{s}a \v{S}egvi\'c, Giorgos Tolias• 2026

Related benchmarks

Task	Dataset	Result
Open Vocabulary Semantic Segmentation	Pascal VOC 20	mIoU91.5	113
Open Vocabulary Semantic Segmentation	Pascal Context PC-59	mIoU41.5	99
Open Vocabulary Semantic Segmentation	Cityscapes	mIoU40.1	81
Open Vocabulary Semantic Segmentation	ADE20K	mIoU26.1	80
Open Vocabulary Semantic Segmentation	Pascal VOC 21	mIoU51.2	41
Open Vocabulary Semantic Segmentation	Pascal Context 60	mIoU37.1	38

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord