CoT-PL: Chain-of-Thought Pseudo-Labeling for Open-Vocabulary Object Detection
About
Open-vocabulary object detection (OVD) aims to recognize and localize object categories beyond the training set. Recent approaches leverage vision-language models to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on single-step image-text matching, neglecting the intermediate reasoning steps crucial for interpreting semantically complex visual contexts, such as crowding or occlusion. In this paper, we introduce CoT-PL, a framework that incorporates visual chain-of-thought reasoning into the pseudo-labeling process for OVD. It decomposes complex scene understanding into three interpretable steps-object localization, category recognition, and background grounding-where these intermediate reasoning states serve as rich supervision sources. Extensive experiments on standard OVD evaluation protocols demonstrate that CoT-PL achieves state-of-the-art performance with superior pseudo-labeling efficiency, outperforming the strong baseline by 9.4 AP50 for novel classes on OV-COCO and improving box and mask APr by 3.2 and 2.2, respectively, on OV-LVIS. Code and models are available at https://github.com/hchoi256/cotpl.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | MS-COCO 2017 (val) | -- | 237 | |
| Open-vocabulary object detection | OV-COCO | AP@50 (Novel)47.8 | 31 | |
| Instance Segmentation | OV-LVIS | AP (Rare)24.8 | 23 | |
| Object Detection | OV-LVIS | AP (Rare)26.4 | 21 | |
| Object Detection | Objects365 v2 (val) | AP5022.7 | 16 |