Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoT-PL: Chain-of-Thought Pseudo-Labeling for Open-Vocabulary Object Detection

About

Open-vocabulary object detection (OVD) aims to recognize and localize object categories beyond the training set. Recent approaches leverage vision-language models to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on single-step image-text matching, neglecting the intermediate reasoning steps crucial for interpreting semantically complex visual contexts, such as crowding or occlusion. In this paper, we introduce CoT-PL, a framework that incorporates visual chain-of-thought reasoning into the pseudo-labeling process for OVD. It decomposes complex scene understanding into three interpretable steps-object localization, category recognition, and background grounding-where these intermediate reasoning states serve as rich supervision sources. Extensive experiments on standard OVD evaluation protocols demonstrate that CoT-PL achieves state-of-the-art performance with superior pseudo-labeling efficiency, outperforming the strong baseline by 9.4 AP50 for novel classes on OV-COCO and improving box and mask APr by 3.2 and 2.2, respectively, on OV-LVIS. Code and models are available at https://github.com/hchoi256/cotpl.

Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim• 2025

Related benchmarks

TaskDatasetResultRank
Object DetectionMS-COCO 2017 (val)--
237
Open-vocabulary object detectionOV-COCO
AP@50 (Novel)47.8
31
Instance SegmentationOV-LVIS
AP (Rare)24.8
23
Object DetectionOV-LVIS
AP (Rare)26.4
21
Object DetectionObjects365 v2 (val)
AP5022.7
16
Showing 5 of 5 rows

Other info

Follow for update