CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

About

Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang• 2025

Related benchmarks

Task	Dataset	Result
Object Detection	COCO (val)	mAP34.6	637
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy86.5	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy91.6	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.942	346
Referring Expression Comprehension	RefCOCOg (test)	Accuracy88.9	300
Referring Expression Comprehension	RefCOCOg (val)	Accuracy88.6	300
Multimodal Understanding	MMBench CN	Accuracy80.7	254
Referring Expression Comprehension	RefCOCO+ (testB)	Accuracy80.1	244
Referring Expression Comprehension	RefCOCO+ (testA)	Accuracy91.8	216
Referring Expression Comprehension	RefCOCO (testB)	Accuracy88.1	213

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord