EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation

About

Open-vocabulary panoptic segmentation aims to segment and classify everything in diverse scenes across an unbounded vocabulary. Existing methods typically employ two-stage or single-stage framework. The two-stage framework involves cropping the image multiple times using masks generated by a mask generator, followed by feature extraction, while the single-stage framework relies on a heavyweight mask decoder to make up for the lack of spatial position information through self-attention and cross-attention in multiple stacked Transformer blocks. Both methods incur substantial computational overhead, thereby hindering the efficiency of model inference. To fill the gap in efficiency, we propose EOV-Seg, a novel single-stage, shared, efficient, and spatialaware framework designed for open-vocabulary panoptic segmentation. Specifically, EOV-Seg innovates in two aspects. First, a Vocabulary-Aware Selection (VAS) module is proposed to improve the semantic comprehension of visual aggregated features and alleviate the feature interaction burden on the mask decoder. Second, we introduce a Two-way Dynamic Embedding Experts (TDEE), which efficiently utilizes the spatial awareness capabilities of ViT-based CLIP backbone. To the best of our knowledge, EOV-Seg is the first open-vocabulary panoptic segmentation framework towards efficiency, which runs faster and achieves competitive performance compared with state-of-the-art methods. Specifically, with COCO training only, EOV-Seg achieves 24.5 PQ, 32.1 mIoU, and 11.6 FPS on the ADE20K dataset and the inference time of EOV-Seg is 4-19 times faster than state-of-theart methods. Especially, equipped with ResNet50 backbone, EOV-Seg runs 23.8 FPS with only 71M parameters on a single RTX 3090 GPU. Code is available at https://github.com/nhw649/EOV-Seg.

Hongwei Niu, Jie Hu, Jianghang Lin, Guannan Jiang, Shengchuan Zhang• 2024

Related benchmarks

Task	Dataset	Result
Open Vocabulary Semantic Segmentation	ADE20K A-150	mIoU32.1	79
Road Extraction	Massachusetts	mIoU54.56	67
Open Vocabulary Semantic Segmentation	PASCAL Context 59 (val)	mIoU56.9	57
Building Extraction	INRIA	mIoU68.32	50
Building Extraction	xBD pre	IoU67.37	50
Building Extraction	WHUAerial	IoU75.71	41
Flood Detection	WBS-SI	mIoU0.5771	35
Road Segmentation	CHN6-CUG	mIoU64.81	34
Road Extraction	SpaceNet	mIoU0.6052	33
Semantic segmentation	OVRSISBench V2	DLRSD mIoU20.52	31

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord