Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

About

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.

Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng• 2025

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP55.6
2643
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy55.3
354
Referring Expression ComprehensionRefCOCO (val)--
344
Referring Expression ComprehensionRefCOCO (testA)--
342
Referring Expression ComprehensionRefCOCOg (test)
Accuracy72.5
300
Referring Expression ComprehensionRefCOCOg (val)--
300
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy49.3
244
Referring Expression ComprehensionRefCOCO+ (testA)--
216
Referring Expression ComprehensionRefCOCO (testB)
Accuracy64
205
Object DetectionLVIS (minival)
AP51.1
141
Showing 10 of 42 rows

Other info

Code

Follow for update