Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks

About

Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

Yu Qi, Yumeng Zhang, Chenting Gong, Xiao Tan, Weiming Zhang, Wei Zhang, Jingdong Wang• 2025

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO (val)
mAP34.6
613
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy86.5
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy91.6
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.942
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy88.9
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy88.6
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy80.1
235
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy91.8
207
Referring Expression ComprehensionRefCOCO (testB)
Accuracy88.1
196
Multimodal UnderstandingMMBench CN
Accuracy80.7
162
Showing 10 of 14 rows

Other info

Follow for update