Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

About

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home

Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova• 2022

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)--
2454
Object DetectionCOCO (val)
mAP32.5
613
Object DetectionLVIS v1.0 (val)
APbbox34.9
518
Object DetectionCOCO
AP50 (Box)53.1
190
Instance SegmentationLVIS v1.0 (val)--
189
Object DetectionOV-COCO
AP50 (Novel)28
97
Instance SegmentationLVIS
mAP (Mask)34.9
68
Object DetectionLVIS
APr32.8
59
Open-vocabulary object detectionLVIS v1 (val)
AP_r^b32.8
54
Object DetectionObjects365 (val)
mAP16.2
48
Showing 10 of 29 rows

Other info

Follow for update