Scaling Open-Vocabulary Object Detection

About

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

Matthias Minderer, Alexey Gritsenko, Neil Houlsby• 2023

Related benchmarks

Task	Dataset	Result
Object Detection	LVIS v1.0 (val)	APbbox50.4	542
Crowd Counting	ShanghaiTech Part B	MAE81.5	177
Object Detection	LVIS (val)	mAP49.4	170
Object Detection	LVIS (minival)	AP54.1	159
Crowd Counting	ShanghaiTech Part A	MAE420.2	155
Object Detection	LVIS mini (val)	mAP57.2	120
Object Detection	ODinW-13	AP53	98
Instance Segmentation	LVIS	mAP (Mask)50.4	81
Object Detection	ODinW-35	AP24.4	79
Object Detection	COCO	AP (bbox)46.1	66

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord