Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Thyme: Think Beyond Images

About

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Score68.8
385
Multi-discipline Multimodal UnderstandingMMMU--
317
Visual Mathematical ReasoningMathVista
Accuracy70
278
Object HallucinationPOPE Popular--
273
Mathematical ReasoningMathVista
Accuracy70
257
Optical Character RecognitionOCRBench
Score863
232
Mathematical Multimodal ReasoningMathVista
Accuracy70
218
Multimodal Math ReasoningMathVision
Accuracy27.6
183
Multimodal Math ReasoningWeMath
Accuracy39.3
168
Mathematical ReasoningWeMath
Accuracy39.3
161
Showing 10 of 105 rows
...

Other info

Follow for update