Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

About

Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy89.3	2056
Multimodal Reasoning	MM-Vet	MM-Vet Score66.21	551
Optical Character Recognition	OCRBench	Score632	486
Multi-discipline Multimodal Understanding	MMMU	Accuracy46.1	422
Mathematical Reasoning	AIME 2024	Accuracy1.3	370
Visual Mathematical Reasoning	MathVision	Accuracy22.9	298
Mathematical Reasoning	Minerva Math	Accuracy0.74	228
Mathematical Reasoning	AIME 2025	Accuracy0.00e+0	227
Mathematical Reasoning	AMC	Accuracy0.00e+0	221
Visual Grounded Reasoning	TreeBench	Overall Score39.75	162

Showing 10 of 62 rows

Other info

Follow for update

@wizwand_team Discord