F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
About
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Object Detection | COCO (val) | mAP32.5 | 613 | |
| Object Detection | LVIS v1.0 (val) | APbbox34.9 | 518 | |
| Object Detection | COCO | AP50 (Box)53.1 | 190 | |
| Instance Segmentation | LVIS v1.0 (val) | -- | 189 | |
| Object Detection | OV-COCO | AP50 (Novel)28 | 97 | |
| Instance Segmentation | LVIS | mAP (Mask)34.9 | 68 | |
| Object Detection | LVIS | APr32.8 | 59 | |
| Open-vocabulary object detection | LVIS v1 (val) | AP_r^b32.8 | 54 | |
| Object Detection | Objects365 (val) | mAP16.2 | 48 |