Aligning Bag of Regions for Open-Vocabulary Object Detection

About

Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, Chen Change Loy• 2023

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	--	2843
Object Detection	PASCAL VOC 2007 (test)	--	844
Object Detection	COCO (val)	mAP36.3	637
Object Detection	LVIS v1.0 (val)	APbbox29.5	542
Object Detection	MS-COCO 2017 (val)	--	264
Object Detection	COCO	AP50 (Box)56.1	237
Object Detection	MS-COCO (val)	--	222
Instance Segmentation	LVIS v1.0 (val)	AP (Rare)27.6	189
Object Detection	LVIS (val)	mAP29.5	170
Object Detection	OV-COCO	AP50 (Novel)42.7	168

Showing 10 of 44 rows

Other info

Code

Follow for update

@wizwand_team Discord