LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
About
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP55.6 | 2454 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy55.3 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | -- | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | -- | 333 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy72.5 | 291 | |
| Referring Expression Comprehension | RefCOCOg (val) | -- | 291 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy49.3 | 235 | |
| Referring Expression Comprehension | RefCOCO+ (testA) | -- | 207 | |
| Referring Expression Comprehension | RefCOCO (testB) | Accuracy64 | 196 | |
| Object Detection | LVIS (val) | mAP42 | 141 |