CoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language Tasks
About
Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO (val) | mAP34.6 | 613 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy86.5 | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy91.6 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.942 | 333 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy88.9 | 291 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy88.6 | 291 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy80.1 | 235 | |
| Referring Expression Comprehension | RefCOCO+ (testA) | Accuracy91.8 | 207 | |
| Referring Expression Comprehension | RefCOCO (testB) | Accuracy88.1 | 196 | |
| Multimodal Understanding | MMBench CN | Accuracy80.7 | 162 |