Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection

About

In pursuit of detecting unstinted objects that extend beyond predefined categories, prior arts of open-vocabulary object detection (OVD) typically resort to pretrained vision-language models (VLMs) for base-to-novel category generalization. However, to mitigate the misalignment between upstream image-text pretraining and downstream region-level perception, additional supervisions are indispensable, eg, image-text pairs or pseudo annotations generated via self-training strategies. In this work, we propose CCKT-Det trained without any extra supervision. The proposed framework constructs a cyclic and dynamic knowledge transfer from language queries and visual region features extracted from VLMs, which forces the detector to closely align with the visual-semantic space of VLMs. Specifically, 1) we prefilter and inject semantic priors to guide the learning of queries, and 2) introduce a regional contrastive loss to improve the awareness of queries on novel objects. CCKT-Det can consistently improve performance as the scale of VLMs increases, all while requiring the detector at a moderate level of computation overhead. Comprehensive experimental results demonstrate that our method achieves performance gain of +2.9% and +10.2% AP50 over previous state-of-the-arts on the challenging COCO benchmark, both without and with a stronger teacher model.

Chuhan Zhang, Chaoyang Zhu, Pingcheng Dong, Long Chen, Dong Zhang• 2025

Related benchmarks

Task	Dataset	Result
Object Detection	COCO	AP50 (Box)53.2	237
Object Detection	LVIS	APr18.2	59
Open-vocabulary object detection	OV-COCO	AP@50 (Novel)41.9	31
Open-vocabulary object detection	OV-COCO (test)	--	28
Object Detection	OV-LVIS	AP (Rare)18.2	21
Object Detection	Object365	AP13.4	17
Object Detection	COCO Novel Base All 2017	AP Novel46	9

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord