CoT-PL: Chain-of-Thought Pseudo-Labeling for Open-Vocabulary Object Detection

About

Open-vocabulary object detection (OVD) aims to recognize and localize object categories beyond the training set. Recent approaches leverage vision-language models to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on single-step image-text matching, neglecting the intermediate reasoning steps crucial for interpreting semantically complex visual contexts, such as crowding or occlusion. In this paper, we introduce CoT-PL, a framework that incorporates visual chain-of-thought reasoning into the pseudo-labeling process for OVD. It decomposes complex scene understanding into three interpretable steps-object localization, category recognition, and background grounding-where these intermediate reasoning states serve as rich supervision sources. Extensive experiments on standard OVD evaluation protocols demonstrate that CoT-PL achieves state-of-the-art performance with superior pseudo-labeling efficiency, outperforming the strong baseline by 9.4 AP50 for novel classes on OV-COCO and improving box and mask APr by 3.2 and 2.2, respectively, on OV-LVIS. Code and models are available at https://github.com/hchoi256/cotpl.

Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim• 2025

Related benchmarks

Task	Dataset	Result
Object Detection	MS-COCO 2017 (val)	--	264
Open-vocabulary object detection	OV-COCO	AP@50 (Novel)47.8	31
Instance Segmentation	OV-LVIS	AP (Rare)24.8	23
Object Detection	OV-LVIS	AP (Rare)26.4	21
Object Detection	Objects365 v2 (val)	AP5022.7	16

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord