Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

About

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 34.1 $AP_r$ on LVIS, surpassing the best existing approach by +7.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

Dahun Kim, Anelia Angelova, Weicheng Kuo• 2023

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	--	2843
Object Detection	LVIS v1.0 (val)	APbbox36.2	542
Image-to-Text Retrieval	Flickr30K 1K (test)	R@192.1	491
Text-to-Image Retrieval	Flickr30K 1K (test)	R@180.7	432
Image-to-Text Retrieval	MS-COCO 5K (test)	R@168.9	320
Text-to-Image Retrieval	MS-COCO 5K (test)	R@151.8	244
Instance Segmentation	LVIS v1.0 (val)	--	189
Object Detection	OV-COCO	AP50 (Novel)33	168
Object Detection	Objects365 (val)	mAP17.7	102
Instance Segmentation	LVIS	mAP (Mask)32.9	81

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord