Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation

About

Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.

Yuanbing Zhu, Bingke Zhu, Yingying Chen, Yunfang Niu, Ming Tang, Jinqiao Wang• 2024

Related benchmarks

TaskDatasetResultRank
Open Vocabulary Semantic SegmentationPascal VOC 20
mIoU97.6
62
Open Vocabulary Semantic SegmentationADE-847
mIoU16.1
59
Open Vocabulary Semantic SegmentationPascal Context PC-59
mIoU64.1
57
Open Vocabulary Semantic SegmentationADE20K A-150
mIoU36.9
54
Open Vocabulary Semantic SegmentationPC-459
mIoU24.1
34
Showing 5 of 5 rows

Other info

Follow for update