VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
About
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy87.65 | 1455 | |
| Visual Question Answering | ChartQA | Accuracy79.9 | 371 | |
| Optical Character Recognition | OCRBench | -- | 232 | |
| Chart Question Answering | ChartQA (test) | Accuracy73.9 | 176 | |
| Document Visual Question Answering | DocVQA (val) | Accuracy93.7 | 157 | |
| Visual Grounded Reasoning | TreeBench | Overall Score41 | 128 | |
| Multi-modal Question Answering | MMBench | Accuracy82.73 | 55 | |
| Mathematical Visual Question Answering | MathVista | Accuracy23.8 | 47 | |
| OCR & Document Understanding | OCRBench | Score808 | 28 | |
| Visual Perception Reasoning | V*Bench | Score73.8 | 28 |