Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

About

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui• 2021

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP36.6
2643
Instance SegmentationCOCO 2017 (val)--
1201
Semantic segmentationADE20K
mIoU26.4
1024
Object DetectionPASCAL VOC 2007 (test)
mAP36.6
844
Object DetectionCOCO (val)
mAP51.3
633
Object DetectionLVIS v1.0 (val)
APbbox29.3
529
Object DetectionCOCO 2017
AP (Box)39.1
321
Object DetectionDOTA 1.0 (test)--
256
Object DetectionCOCO
AP50 (Box)55.6
237
Object DetectionMS-COCO 2017 (val)--
237
Showing 10 of 61 rows

Other info

Code

Follow for update