OpenSD: Unified Open-Vocabulary Segmentation and Detection

About

Recently, a few open-vocabulary methods have been proposed by employing a unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due to the conflict between different tasks, and their open-vocabulary capability is limited due to the inadequate use of CLIP. To address these challenges, we present a universal transformer-based framework, abbreviated as OpenSD, which utilizes the same architecture and network parameters to handle open-vocabulary segmentation and detection tasks. First, we introduce a decoder decoupled learning strategy to alleviate the semantic conflict between thing and staff categories so that each individual task can be learned more effectively under the same framework. Second, to better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain, respectively. The text encoder is further trained to be region-aware for both thing and stuff categories through decoupled prompt learning, enabling them to filter out duplicated and low-quality predictions, which is important to end-to-end segmentation and detection. Extensive experiments are conducted on multiple datasets under various circumstances. The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings. Code is available at https://github.com/strongwolf/OpenSD

Shuai Li, Minghan Li, Pengfei Wang, Lei Zhang• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	COCO	mIoU29.1	110
Open-Vocabulary Segmentation	Cityscapes	mIoU52.2	49
Panoptic Segmentation	COCO closed-vocabulary	PQ58.8	18
Instance Segmentation	COCO closed-vocabulary	Mask AP50.9	16
Semantic segmentation	COCO closed-vocabulary	mIoU68.3	16
Semantic segmentation	ADE20K open-vocabulary	mIoU30.8	15
Panoptic Segmentation	ADE20K open-vocabulary	PQ23.1	14
Instance Segmentation	ADE20K open-vocabulary	Mask AP15	13
Object Detection	COCO closed-vocabulary	AP (Box)56.7	13
Panoptic Segmentation	Cityscapes open-vocabulary	PQ39.6	11

Showing 10 of 20 rows

Other info

Code

Follow for update

@wizwand_team Discord