MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

About

Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.

Yuanbing Zhu, Bingke Zhu, Yingying Chen, Yunfang Niu, Ming Tang, Jinqiao Wang• 2024

Related benchmarks

Task	Dataset	Result
Open Vocabulary Semantic Segmentation	Pascal VOC 20	mIoU97.6	113
Open Vocabulary Semantic Segmentation	Pascal Context PC-59	mIoU64.1	99
Open Vocabulary Semantic Segmentation	ADE20K A-150	mIoU36.9	79
Open Vocabulary Semantic Segmentation	ADE-847	mIoU16.1	63
Open Vocabulary Semantic Segmentation	PC-459	mIoU24.1	47

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord