Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

About

Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang• 2025

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO (val)
mAP34.6
637
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy86.5
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy91.6
348
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.942
346
Referring Expression ComprehensionRefCOCOg (test)
Accuracy88.9
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy88.6
300
Multimodal UnderstandingMMBench CN
Accuracy80.7
254
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy80.1
244
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy91.8
216
Referring Expression ComprehensionRefCOCO (testB)
Accuracy88.1
213
Showing 10 of 14 rows

Other info

Follow for update