VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

About

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.

Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.65	2019
Visual Question Answering	ChartQA	Accuracy79.9	519
Optical Character Recognition	OCRBench	--	433
Chart Question Answering	ChartQA (test)	Accuracy73.9	190
Document Visual Question Answering	DocVQA (val)	Accuracy93.7	166
Visual Grounded Reasoning	TreeBench	Overall Score41	153
Mathematical Visual Question Answering	MathVista	Accuracy23.8	87
Multi-modal Question Answering	MMBench	Accuracy82.73	84
Multi-modal Question Answering	MMMU	Accuracy51	83
OCR & Document Understanding	OCRBench	Score808	47

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord